Out of memory error with --mds-plot, even when using --memory

857 views
Skip to first unread message

Jiang

unread,
Jun 22, 2017, 5:19:07 PM6/22/17
to plink2-users
Hi,

I'm trying to analyze a large dataset (>100K individuals, ~2K markers).

I have been specifying "--memory 3500" (I should allocate less than 4GB in RAM in a server) and most Plink functions seem to work fine, but when running "--mds-plot", I invariably get an "out of memory" error, even when specifying "--memory".

Below is the log file. Could you please provide some advice? Thanks

-------

PLINK v1.90b4.4 64-bit (21 May 2017)
Options in effect:
  --bfile theinput
  --cluster
  --extract getsnps.snplist.txt
  --mds-plot 4
  --memory 3500
  --mind .05
  --out theoutput
  --silent

Hostname: thehost.local
Working directory: /path/to/files
Start time: Thu Jun 22 13:23:56 2017

Random number seed: 1498163036
64380 MB RAM detected; reserving 3500 MB for main workspace.
XXXX variants loaded from .bim file.
XXXXXX people (XXXXX males, XXXXX females) loaded from .fam.
--extract: XXXX variants remaining.
136 people removed due to missing genotype data (--mind).
IDs written to theoutput.irem .
Using up to 15 threads (change this with --threads).
Before main variant filters, XXXXXX founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.995424.
XXXX variants and XXXXXX people pass filters and QC.
Note: No phenotypes present.

Error: Out of memory.  The --memory flag may be helpful.
Failed allocation size: 94347879936

End time: Thu Jun 22 13:23:56 2017


Christopher Chang

unread,
Jun 22, 2017, 6:19:42 PM6/22/17
to plink2-users
That's because --mds-plot requires enough memory for an NxN matrix, where N is the number of points in the multidimensional scaling operation.  You have N>100k, so plink really needed ~100 GB to perform this operation.

The standard workaround is to use "--cluster --mds-plot by-cluster --K [# of clusters to use for MDS]", instead of just --cluster --mds-plot.

Christopher Chang

unread,
Jun 22, 2017, 6:26:53 PM6/22/17
to plink2-users
Actually, sorry, that wouldn't work either, since --cluster also requires enough memory for a NxN matrix.

With a 4GB-per-process limit, you may want to just use plink to perform the distance matrix computation (since, with --parallel, this computation can be split into ~3.5GB pieces); then use another program to perform clustering/scaling on that matrix.

Jiang

unread,
Jun 23, 2017, 1:43:59 AM6/23/17
to plink2-users
Hi,

Thanks for the prompt reply. I've tried changing to a different server, and using --memory 200000 (200GB), and reserving up to 500GB for the job. Yet I get a very similar error message: the memory allocation issue persists. The log file is almost the same as above, with changes in the following two lines:

  --memory 200000
Using up to 31 threads (change this with --threads).
Failed allocation size: 755048449856


So it seems like the memory allocation requirement is even larger (too large). Am I missing something? If there is no better alternative, I will definitely compute the distance matrix and then cluster/scale it outside Plink. Could you please provide some advice on how to do it? Maybe R's SVD (cmdscale, as mentioned in the documentation)?

Best,

Christopher Chang

unread,
Jun 23, 2017, 1:50:42 AM6/23/17
to plink2-users
If you're fine with computing the top 2 principal components instead of MDS coordinates, plink 2.0's "--pca approx 2" has a reasonable memory requirement even in the hundreds-of-thousands-of-samples case, and also shouldn't take very long to run.  (This is not in plink 1.9, though.)

Jiang

unread,
Jun 23, 2017, 1:12:18 PM6/23/17
to plink2-users
Thanks! That certainly ran smoothly.
Best,
Reply all
Reply to author
Forward
0 new messages