Issue with PCA

576 views
Skip to first unread message

Manyan Huang

unread,
Jun 22, 2021, 4:55:45 PM6/22/21
to plink2-users
Dear all,

I have met an issue. I wanted to run PCA on a subset of samples and the code is as below.

plink --bfile $bf --keep $s --pca --allow-no-sex --out /geode2/home/u030/huanshan/Carbonate/manyan/GbyG/Rochy_application/PCA/PCA_rep/pca_rep

However, the job kept running and didn't end. The last step that it got stuck is

.........
3888000 markers complete.
3888045 markers complete.
Relationship matrix calculation complete.
[extracting eigenvalues and eigenvectors]

Any help or suggestion would be highly appreciated.

Thank you,
Manyan





Christopher Chang

unread,
Jun 22, 2021, 5:08:19 PM6/22/21
to plink2-users
If you run with --debug, what's the top of the resulting .log file?

Manyan Huang

unread,
Jun 22, 2021, 5:19:49 PM6/22/21
to plink2-users
Thank you very much for your reply! Here is the resulting .log file

PLINK v1.90b6.5 64-bit (13 Sep 2018)
Options in effect:
  --allow-no-sex
  --bfile ....../Rocky/aim1aim2.ACGT.final
  --debug
  --keep ......../PCA_rep/all_rep_id_in_fam
  --out ......../PCA_rep/pca_rep
  --pca 10

Hostname: c48
Working directory: ....../PCA/PCA_rep
Start time: Tue Jun 22 17:11:15 2021

Random number seed: 1624396275
257087 MB RAM detected; reserving 128543 MB for main workspace.
Allocated 96407 MB successfully, after larger attempt(s) failed.
3998891 variants loaded from .bim file.
6485 people (2451 males, 4033 females, 1 ambiguous) loaded from .fam.
Ambiguous sex ID written to
/geode2/home/u030/huanshan/Carbonate/manyan/GbyG/Rochy_application/PCA/PCA_rep/pca_rep.nosex
.
4142 phenotype values loaded from .fam.
--keep: 2507 people remaining.
Using up to 23 threads (change this with --threads).
Before main variant filters, 1517 founders and 990 nonfounders present.
Calculating allele frequencies... done.
Warning: 1508544 het. haploid genotypes present (see
/geode2/home/u030/huanshan/Carbonate/manyan/GbyG/Rochy_application/PCA/PCA_rep/pca_rep.hh
); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate in remaining samples is 0.856529.
3998891 variants and 2507 people pass filters and QC.
Among remaining phenotypes, 1462 are cases and 1045 are controls.
Excluding 110846 variants on non-autosomes from relationship matrix calc.


Christopher Chang

unread,
Jun 22, 2021, 5:34:23 PM6/22/21
to plink2-users
1. Are you running on a shared compute cluster?  If yes, does this still get stuck if you add "--memory 4000"?
2. If that doesn't fix the problem, what happens if you run with a newer plink 1.9 build?  (I would not expect this to make a difference, but there has been a small tweak to how plink calls linear algebra libraries so that results are more consistent across machines, so, better safe than sorry.)
3. If that also gets stuck, is it possible for you to post a dataset that I can replicate the issue with?  It could contain fewer samples and/or variants, as long as you still see this command getting stuck.

Hannah Trebes

unread,
Jul 4, 2021, 5:58:04 PM7/4/21
to plink2-users
Hi all,

Off the back of this issue, I've been "stuck" at this stage [extracting eigenvalues and eigenvectors] for a few days but the program appears to still be performing the calculations it says it is (looking at CPU and memory usage). 

If at all possible it would be great to get some kind of progress reporting at this stage as it appears there is nothing else until the program is finished.

Cheers.

Manyan Huang

unread,
Jul 4, 2021, 6:02:15 PM7/4/21
to plink2-users
Hi Hannah,

For my issue, it happens due to the highly correlation between samples. I am not sure your situation is same as mine, but you could try separate samples and then run pca.

Best,
Manyan

Chris Chang

unread,
Jul 4, 2021, 6:19:46 PM7/4/21
to Hannah Trebes, plink2-users
This is a single call to a linear-algebra-library function, so no direct progress indicator can be provided, sorry.

The usual reason for this operation to be slow is a large (tens or hundreds of thousands) of samples.  Use plink 2.0 “—pca approx” for that.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/69970924-8373-4c21-aaf5-7f8799f326d9n%40googlegroups.com.

krts

unread,
Jul 5, 2021, 2:03:22 AM7/5/21
to plink2-users
Hi,

 I am using a rare variants Exome Data.

1)

I am having , Issues with updating rs_ids for exome sequences , when I try to  do it by combining chromosome number and position by comparing with all_phase3.bim and try to extract the overlapping name from the my data using 'NR==FNR command I am getting all into overlapping SNPs and when I extract using feeling I am greeting error: No variants remaining after --extract. ? How Can I manage ,

2)
     When I try to do Population stratification , with KING i am getting No autosome genotypes available for KING inferences.

4)  if I perform pCA using   plink --bfile mydata12  --pca 3   then How to extract outliers using link.eigenval or plink.eigenvec .?  and what are the conditions can be used to extract the outlier SNPS

3)  Why we perform this step?  for multiple testing corrections?
                plink --bfile mydata-12 --assoc --mperm 1000000 --out 1Mill_perm_result_sub

4) If I enrich the Genotypes by reducing unwanted SNPs related to a specific disease , is there are chances for getting some novel association ?
Reply all
Reply to author
Forward
0 new messages