Hello,
I am attempting to characterize population substructure for a large dataset (152k individuals) with PLINK 2.
I ran PCA analysis with 120 GB of memory, and obtained the following error message in the log:
PLINK v2.00aLM 64-bit Intel (5 Jun 2017)
www.cog-genomics.org/plink/2.0/(C) 2005-2017 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to pca_results.log.
Options in effect:
--bfile genome
--maf 0.10
--memory 120000
--out pca_results
--pca 5
Start time: Mon Jul 10 16:33:25 2017
128828 MB RAM detected; reserving 120000 MB for main workspace.
Using up to 16 threads (change this with --threads).
152727 samples (80986 females, 71741 males; 152727 founders) loaded from
/myrandomdirectory/plink/genome.fam.
847131 variants loaded from
/myrandomdirectory/plink/genome.bim.
Note: No phenotype data present.
152727 samples (80986 females, 71741 males; 152727 founders) remaining after
main filters.
Calculating allele frequencies... done.
518660 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
328471 variants remaining after main filters.
Error: Out of memory. The --memory flag may be helpful.
What would be the expected amount of RAM needed to perform PCA on a dataset of this size? If this is not feasible with plink 2.0, what would be some tricks to obtain the population substructure w/ plink (i.e. I've already restricted maf to 0.10 or higher in hopes of filtering the dataset size). Anything else I should do?
Thank you for your help, much appreciated!
-Anna-