Population Stratification --pca finds all zero eigenvalues

1,757 views
Skip to first unread message

Pietro Biroli

unread,
Aug 9, 2014, 2:42:19 PM8/9/14
to plink2...@googlegroups.com
Dear All,

I have been trying to construct the top 20 principal components of of the variance-standardized relationship matrix in order to control for population stratification in my analysis. 
I am using the offspring cohort of the Framingham Heart Study, and I am running the following command:

plink --bfile FHS_offspring --pca header --out FHS_pca

The procedure seems to run smoothly and no error term pops up; however, all the eigenvalues and the eigenvectors are equal to zero. 
I tried some variations on this basic command (adding "--geno 0.2"; manually removing all the individuals with more than 50% of genotype missing; constructing clusters using the whole family structure and then controlling for "--pca-clusters FHScluster.clustrer1 --family") but nothing seems to work: I still get eigenvalues equal to zero.

Any help would be greatly appreciated, thanks!
Pietro

Christopher Chang

unread,
Aug 9, 2014, 4:27:03 PM8/9/14
to plink2...@googlegroups.com
Hmm, the last time I saw this happen, it was due to the presence of NA entries in the GRM.

* What does your GRM look like?
* If you use GCTA's --make-grm/--pca commands, do you also see the same problems?

Pietro Biroli

unread,
Aug 11, 2014, 12:04:26 PM8/11/14
to plink2...@googlegroups.com
Hi Christopher,

thanks for the reply. I hadn't calculated the GRM before, but now that I did using the command --make-grm-gz I see that there are some "inf" in the file `dataname'.grm.gz 
However even if I write in the same command line --make-grm-gz --pca this doesn't solve the issue of zero eigenvalues. 

Thank you so much for your help,
best
Pietro


PS
If I write the following command:

$ plink --bfile FHS_offspring --make-grm --pca --out FHS_prova

I get this log file

Random number seed: 1407772682
516862 MB RAM detected; reserving 40000 MB for main workspace.
500568 variants loaded from .bim file.
3742 people (1780 males, 1962 females) loaded from .fam.
Using up to 63 threads (change this with --threads).
Calculating allele frequencies... done.
Warning: 453 het. haploid genotypes present (see FHS_prova.hh ).
Total genotyping rate is 0.984419.
500568 variants and 3742 people pass filters and QC.
Note: No phenotypes present.
Excluding 9868 variants on non-autosomes from relationship matrix calc.
Relationship matrix calculation complete.
Relationship matrix written to FHS_prova.grm.gz , and IDs written to
--pca: Results saved to FHS_prova.eigenval and FHS_prova.eigenvec .

Christopher Chang

unread,
Aug 11, 2014, 1:37:18 PM8/11/14
to plink2...@googlegroups.com
Okay, the problem is definitely caused by the 'inf' GRM values.

Would it be possible for you to find a chromosome where you get 'inf' GRM values, and send me data from just that chromosome to test with?  If not, the next things to try are:
* Does "--maf 0.01" make the inf values go away?
* Does "--threads 1" make them go away?  (Yes, you'll only want to run this on one chromosome if possible...)

Pietro Biroli

unread,
Aug 11, 2014, 5:44:46 PM8/11/14
to plink2...@googlegroups.com
Dear Christopher,
I ran the command using only those with --maf 0.01 and it worked: GRM values are all numbers now, and I get eigenvalues and eigenvectors that make sense.
Thanks so much!
Pietro

Lilian Antunes

unread,
Apr 25, 2016, 2:13:33 PM4/25/16
to plink2-users
I am running onto same problems. The GRM looks normal (all values)  and I've tried what you've suggested here and still getl eigenvector all zeros

 plink_format]$ plink1.9 --bfile 490WES_PCA.in_plink --pca header tabs --threads 1 --out 490WES_PCA
PLINK v1.90b3r 64-bit (13 Jun 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 490WES_PCA.log.
Options in effect:
  --bfile 490WES_PCA.in_plink
  --out 490WES_PCA
  --pca header tabs
  --threads 1

128934 MB RAM detected; reserving 64467 MB for main workspace.
325015 variants loaded from .bim file.
908 people (214 males, 204 females, 490 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 490WES_PCA.nosex .
Using 1 thread.
Before main variant filters, 801 founders and 107 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.481953.
325015 variants and 908 people pass filters and QC.
Note: No phenotypes present.
Relationship matrix calculation complete.
--pca: Results saved to 490WES_PCA.eigenval and 490WES_PCA.eigenvec .


Any suggestion?

thanks,

-lili

Christopher Chang

unread,
Apr 26, 2016, 12:10:47 PM4/26/16
to plink2-users
That's a very high missing call rate (over 50%).  I'm guessing that at least one of the GRM entries is inf or nan, and that's enough to cause the PCA routine to fail.

Try filtering out poorly-genotyped and low-MAF variants with "plink1.9 --bfile 490WES_PCA.in_plink --geno 0.1 --maf 0.01 --make-bed --out 490_WES_filtered", and then see if "plink1.9 --bfile 490_WES_filtered --pca header tabs --mind 0.1 --out 490WES_PCA" works.

Lilian Antunes

unread,
Apr 26, 2016, 4:22:04 PM4/26/16
to plink2-users
Awesome!!! That worked.

Thanks for your help.

-lili

Jay Vaidya

unread,
Aug 19, 2016, 8:23:20 PM8/19/16
to plink2-users
I am calculated the pcas for Framingham's 500K genotype data.

I tried this filtering approach as well, (making sure that the filtered subset bed file was indeed smaller as a sanity check) and am still getting all zeros in the pca eigenvector file. Any further pointers will be gratefully accepted.

plink --bfile  fhs_500kg  --geno 0.1 --maf 0.01 --make-bed --out fhs_500kgwfilt
plink --bfile fhs_500kgwfilt --pca header tabs --mind 0.1 --out fhs_500kgwfilt_pca

Start time: Fri Aug 19 12:31:57 2016

Random number seed: 1471624317
258452 MB RAM detected; reserving 129226 MB for main workspace.
446108 variants loaded from .bim file.
9224 people (4247 males, 4977 females) loaded from .fam.
258 people removed due to missing genotype data (--mind).
IDs written to fhs_500kgwfilt_pca.irem .

Using up to 63 threads (change this with --threads).
Before main variant filters, 2001 founders and 6965 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.986852.
446108 variants and 8966 people pass filters and QC.
Note: No phenotypes present.
Excluding 8391 variants on non-autosomes from relationship matrix calc.
Relationship matrix calculation complete.
--pca: Results saved to fhs_500kgwfilt_pca.eigenval and
fhs_500kgwfilt_pca.eigenvec 
Reply all
Reply to author
Forward
0 new messages