Generalized role to estimate memory requirement for plink analysis

179 views
Skip to first unread message

shihch...@gmail.com

unread,
Dec 16, 2021, 12:07:21 PM12/16/21
to plink2-users
Hi Chris, 

Is there any generalized guide to estimate memory requirement for plink analysis? 

Thanks

Shicheng

Christopher Chang

unread,
Dec 16, 2021, 8:10:11 PM12/16/21
to plink2-users
If you have less than 50 million variants, 8 GB is probably enough for plink2.  Beyond that point, I'd budget ~1 GB per additional 10 million variants.

Shicheng Guo

unread,
Dec 16, 2021, 8:19:14 PM12/16/21
to Christopher Chang, plink2-users
Great. Thank you so much for the vision!! Shicheng

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/58378601-e1e6-41ac-b33f-0df33a21aa03n%40googlegroups.com.

freeseek

unread,
Jan 26, 2023, 12:25:31 PM1/26/23
to plink2-users
I am having an issue while using plink2 within a pipeline where memory requirements need to be strictly specified and requesting too much memory would end up needlessly incur additional costs. I am trying to run "plink2 --bcf ... --make-pgen ..." and I usually provided 3.5GB but with some datasets it does not seem to be enough (the sample size is only ~7,000 samples including dosages, but number of variants is >270k). Not sure about the number of variants. Does plink2 load the whole dataset in memory when performing a simple conversion from VCF? If so, is there an easy way to estimate memory costs as a function of the number of samples and the number of variants?

Christopher Chang

unread,
Jan 26, 2023, 1:38:55 PM1/26/23
to plink2-users
Use the --memory flag to set the size of plink2's (or plink 1.9's) main workspace.  plink2's total memory usage will not exceed this by more than few tens of megabytes.

Christopher Chang

unread,
Jan 26, 2023, 2:12:15 PM1/26/23
to plink2-users
Oops, just realized that may not have fully answered your questions.  No, plink2 does not load the whole dataset in memory at once when converting from VCF.  But it does keep most of the .pvar/.psam contents in memory (exception: INFO column).  The .pvar is usually much larger than the .psam, which is why my earlier answer to Shicheng was based on only the variant count.

Anyway, going back to my original response, when you say you "provided 3.5GB", do you mean that you specified something like "--memory 3500" on the command line?  If not, you may find that --memory does accomplish what you want.

freeseek

unread,
Jan 27, 2023, 10:26:41 AM1/27/23
to plink2-users
I meant that the job manager provided a VM with 3.5GB and I was not using the --memory flag. Now I am trying with --memory 3465 to see if it improves things.

freeseek

unread,
Feb 23, 2023, 11:16:46 AM2/23/23
to plink2-users
I am still having issues. I am running PLINK2 on a VM with 3.5GB and using the --memory 3150 flag now. I am trying to convert a VCF with both genotypes and dosages with 395,716 samples and 127,373 variants. PLINK2 does not seem to have any issues when first scanning the VCF file with almost negligible memory requirements (<0.5GB). However, after scanning the VCF file, PLINK2 gets killed. A smaller run with 21,753 samples and same number of variants shows that the memory requirement after scanning the VCF goes from negligible to ~2.2GB. This memory requirement cannot be explained by the .pvar and the .psam files which are expected to be very small, so I am at odds as to what PLINK2 requires when performing the VCF to PGEN conversion and how to estimate this memory requirement. As this is a costly step when handling imputed biobank data, I am interested in understanding this thoroughly to make sure costs are kept at a minimum. Thank you as always in advance

Christopher Chang

unread,
Feb 23, 2023, 2:14:21 PM2/23/23
to plink2-users
I can post more information about the VCF conversion step's memory requirements later today, but you may have noticed that I had given a generic guideline of "8 GiB for up to 50 million variants, add another GiB for every 10m variants past that" in another similar thread.  The program tries to provide a wide range of usable functionality with a 8 GiB workspace, but there has been no serious effort to reduce memory requirements beyond that point.  I'd expect you to constantly run into problems if you continue trying to use a 3.5GB VM.

freeseek

unread,
Feb 23, 2023, 2:29:31 PM2/23/23
to plink2-users
In retrospect, I see by testing that what happens when you convert from VCF is that PLINK2 will fully use either 50% of the memory or the memory that it is provided with the --memory flag, no matter the size of the VCF. Somehow, when I run it with --memory 3150 and a 3.5GB VM it got killed but it did not complain that it could not allocate memory, so most likely it must have been some combination of PLINK2 together with the base operating system consuming more than 3.5GB together. Maybe here the solution is to simply lower the memory flag and leave more space for a small PLINK2 overrun and some basic operating system needs. I am surprised though, as 3.5GB-3150MB=434MB seemed already a lot to me. And reducing memory needs when possible is relevant all the way to 3.75GB IMO, as this is the smallest unit of cost for Google cloud for the n1-standard-1 machine. Asking more than that immediately doubles the cost

Christopher Chang

unread,
Feb 23, 2023, 3:20:57 PM2/23/23
to plink2-users
I understand that it would provide incremental value, but the top priority for *biobank-scale* dataset analysis is to make it possible at all within a reasonable amount of time.  Making it fit in 3.x GiB instead of 8, when 8 is inexpensively available to practically everyone, is not something I can justify spending time on for a long while to come, and in particular any time/space tradeoffs will continue to be made in favor of reducing runtime when that doesn't push ordinary memory requirements above 8 GiB.

With that said, if plink2 is forcing some large *out-of-workspace* allocation here, whether in its own process space or somehow inducing it from the OS, that's a problem.  Yes, I would have expected 3.5GB - 3150MB to be enough breathing room.  I will investigate this.

freeseek

unread,
Feb 23, 2023, 4:02:32 PM2/23/23
to plink2-users
Just to provide some context, when running computations in the cloud, it pays off to structure large computations as small jobs that can be conveniently run in preemptible format, which is much cheaper. This is in contrast with running computations in large machines with many threads, which is a more typical scenario in HPC environments. If you try to split a large computation in smaller jobs, it is best to run computations on single CPU VMs and since the RAM/CPU ratio is still around 4GiBs/CPU, there is still value in being able to run computations in VMs with less than 4GiB of available RAM. Obviously the VMs will sit in a real machine with much more than 4GiB, but what is available per CPU is what matters. Ultimately when it comes to biobanks, the measure of what it takes to run a computation is how much it costs rather than how much memory, how much space, or how long it takes to run. For my personal needs, having a cheap way to convert a VCF to PGEN makes the difference between deciding whether I should keep the same data stored in one single format (VCF) or whether I should keep it stored in both formats (VCF and PGEN).

This was the run trying converting a VCF with 127,373 variants and 395,716 samples in a VM with 26GB and using --memory 23400:
plink2_vcf2pgen.png
According to the monitoring command, the VM needed a maximum of 23.62GiB which is almost 800MiB more than 23.62GiB-23,400MiB. Some of it must be from the OS, but maybe there is indeed an excess request from PLINK2.

Christopher Chang

unread,
Feb 24, 2023, 12:52:19 AM2/24/23
to plink2-users
Okay, it looks like, on Linux, I should be using the MemAvailable value in /proc/meminfo to estimate the maximum amount of memory a process can actually use; and when I tested this VCF-import job on a VM advertised to have "4 GiB" of memory, the MemAvailable ranged from ~2872 to 2952 MiB (out of 3843, not 4096 total MiB) while I was using the VM, and if I set a --memory value above that the process would be killed.  So, even on a VM with less than 4 GiB of RAM, Linux wants to reserve almost 1 GiB mostly for disk cache, it seems.

The next plink2 build will check the MemAvailable value on Linux and cap the workspace size a bit below it.
Reply all
Reply to author
Forward
0 new messages