Which of PLINK's functions are multi threaded and which are not

3,220 views
Skip to first unread message

Scott Wood

unread,
Aug 27, 2017, 9:28:19 PM8/27/17
to plink2-users
Hi folks,

Some researchers at our institute are trying to take advantage of the multi threading features that PLINK offers The documentation refers to the fact that "multithreaded PLINK functions run..." I am curious to know if it is documented which of the functions are multithread aware and which are not.

For example when running "plink2 --freq --threads 4..." we see only a single thread used.

Does the answer to this question vary from version to version, as we have multiple versions available to our users and would like to help guide them on bet use?

Thanks in advance.

Cheers
Scott

Christopher Chang

unread,
Aug 27, 2017, 10:25:13 PM8/27/17
to plink2-users
There's a small amount of variation between builds: it's easier to write correct single-threaded code than multithreaded code, so sometimes I'll start with a single-threaded implementation and multithread it later.

But in general, v1.9 has multithreading for several of the most computationally expensive operations, but not basic mostly-I/O-bound stuff like --freq and --make-bed.  v2.0 has multithreading almost everywhere, including --freq and --make-bed.  Both versions may reduce the number of worker threads used for a particular operation if e.g. the job appears to be too small to usefully occupy all the threads, or memory is a bit tight.

Scott Wood

unread,
Aug 27, 2017, 11:15:47 PM8/27/17
to plink2-users
Thanks for the quick reply.  The user who was using "--freq" and only seeing one thread being used had the following command line:

plink2  --threads 2 --memory 2500mb --bgen their.bgen --sample their.file.sample  --out their.folder/their.bgen.plink2.output --keep their.pheno   --maf 0.001 --freq

Would you have any guidance on what a reasonable memory request would be so that it wasn't too tight?  From your response, I'd expect that's be the cause of their minimal thread use (or would you need more metrics, such as the size of their bgen file).

As we have many users who use many of the features of PLINK, is there a way to get a table of which ones do and which don't use multiple threads?  We'd like them to get the most out of the tool without impacting the resources for other users.

For example, when a user requests 4 cores from our HPC but only uses one, three cores sit unused and other jobs wait for those unused resources to free up before the scheduler runs them.  When someone submits enough jobs to fill our entire cluster with these jobs, we're running at 25% capacity and other users' jobs sit in line, waiting for their turn.

If we could help educate our PLINK users about which functions WILL gain an advantage by using multiple threads, it'll be a huge help to them, and to all other users of the shared resources.

Cheers

Christopher Chang

unread,
Aug 27, 2017, 11:45:06 PM8/27/17
to plink2-users
* If it's v1.9, all of these operations are single-threaded.  (The main multithreaded commands in plink 1.9 are --make-rel/--make-grm-gz/--make-grm-bin, --distance/--genome/--cluster, and --pca.)
* If it's a recent v2.0 build (July 17 or newer), --bgen import and allele frequency computation should be multithreaded.  (Note that --bgen import is actually the more expensive operation, by a huge margin.)
* The memory requirement depends on both the dimensions of the dataset and what you're trying to do with it.  Most of the time, ~80 bytes per variant will suffice (so 2500mb lets you work with up to ~30 million variants), but if you have tens or hundreds of thousands of samples and need to compute sample x sample matrices, you may need far more memory.

Scott Wood

unread,
Aug 28, 2017, 2:04:32 AM8/28/17
to plink2-users
Thanks again for your quick responses.  I've directed our users to this thread so none of your advice is lost in translation, and I've installed a newer version of PLINK2.  The one they were using was from June 30th (so not July 17 or newer).

With the new binary, your tips, and a bit of testing, they should be good to go.

Cheers
Scott
Reply all
Reply to author
Forward
0 new messages