Just to provide some context, when running computations in the
cloud, it pays off to structure large computations as small jobs that
can be conveniently run in preemptible format, which is much cheaper.
This is in contrast with running computations in large machines with
many threads, which is a more typical scenario in HPC environments. If
you try to split a large computation in smaller jobs, it is best to run
computations on single CPU VMs and since the RAM/CPU ratio is still
around 4GiBs/CPU, there is still value in being able to run computations
in VMs with less than 4GiB of available RAM. Obviously the VMs will sit
in a real machine with much more than 4GiB, but what is available per
CPU is what matters. Ultimately when it comes to biobanks, the measure
of what it takes to run a computation is how much it costs rather than
how much memory, how much space, or how long it takes to run. For my
personal needs, having a cheap way to convert a VCF to PGEN makes the
difference between deciding whether I should keep the same data stored
in one single format (VCF) or whether I should keep it stored in both
formats (VCF and PGEN).
This was the run trying converting a VCF with 127,373 variants and 395,716 samples in a VM with 26GB and using --memory 23400:
According
to the monitoring command, the VM needed a maximum of 23.62GiB which is
almost 800MiB more than 23.62GiB-23,400MiB. Some of it must be from the
OS, but maybe there is indeed an excess request from PLINK2.