plink1.9 is sometimes stuck when scanning .ped file

148 views
Skip to first unread message

Kevin

unread,
Mar 4, 2022, 6:58:45 AM3/4/22
to plink2-users
Hi when I use Plink 1.9 to convert the ped format into bed format, the plink runs ok in most times, but sometimes it stops at a stage, eg. "Scanning .ped file... 12%" and not moving any more. The size of my ped file is 140GB and allocated memory is about 70GB or 100GB. But memory should not be an issue as I had tested that Plink can run successfully with as small as 10GB.
I am just wondering why does that happen and is there a way to make its performance stable?
Thanks

Christopher Chang

unread,
Mar 4, 2022, 10:39:10 AM3/4/22
to plink2-users
1. If you provide detailed instructions for me to replicate what you're seeing on e.g. an Amazon EC2 instance, and include a full .log file of what you're seeing on your end, I will investigate.
2. However, for all practical purposes the bug is in your workflow, not in plink.  The .ped file format was already grossly inefficient relative to .bed in 2007, and it was clearly obsolete by 2014 (by that point, plink's support for even uncompressed VCF import/export was better).  No plink2 build in the last 5 years can even read or write .ped; this is intentional.  Any script that generates a 10GB+ .ped file should almost certainly be generating another format instead.

Kevin

unread,
Mar 4, 2022, 11:33:22 AM3/4/22
to plink2-users
Thanks for your prompt response. 
1. The log file is not very informative since essentially there is no fault happening or reported.  It just remains there and never finishes.
"PLINK v1.90b6.24 64-bit (6 Jun 2021)
Options in effect:
  --chr 1-29
  --chr-set 29
  --make-bed
  --map E:\Inbreeding\inputfiles\test_batch\80K_plink.map
  --no-fid
  --no-parents
  --no-pheno
  --no-sex
  --out wholeMonthly
  --ped E:\Inbreeding\inputfiles\test_batch\monthlyrun.ped
Hostname: GIN03
Working directory: E:\Inbreeding\inputfiles\test_batch
Start time: Wed Feb 09 01:33:21 2022
Random number seed: 1644370401
786336 MB RAM detected; reserving 393168 MB for main workspace.
Allocated 124400 MB successfully, after larger attempt(s) failed.
Scanning .ped file...12%"

2) The reason that the ped format is used is that bed format is not human-readable although it is fast and that's why I am converting to it for following analysis while ped format is easier to generate from the original format in which our genotypes file came.
Currently I just added two options --threshold and --memroy to the command and it seems help, but not sure it will work in every run. 
I am not expecting detailed investigation to be carried out, just try my luck to see is there  any obvious reasons.
Best Regards,

Christopher Chang

unread,
Mar 4, 2022, 11:49:59 AM3/4/22
to plink2-users
As mentioned in my previous response, even uncompressed VCF has been a much better choice than .ped for the last 7 years.  I stand by my claim that this is, for all practical purposes, a bug in your workflow that you should fix as soon as possible.

Christopher Chang

unread,
Mar 14, 2022, 1:39:15 PM3/14/22
to plink2-users
plink2 now has a --pedmap flag that's more efficient than plink 1.9 at importing .ped+.map filesets.

You are still very strongly encouraged to use a different intermediate format.  In particular, --pedmap is unusual among plink2's import commands in that it does not convert directly to plink2's native format; instead it usually spends >90% of its time converting to a temporary individual-major .bed fileset that is ~1/15 the size, and then calls the existing routine to convert that to plink2's native format.  The correspondence between .ped and individual-major .bed is straightforward; if you are able to generate a 140 GB .ped file, you are almost certainly able to directly generate the individual-major .bed fileset with only a little bit of extra work.
On Friday, March 4, 2022 at 8:33:22 AM UTC-8 Kevin wrote:

Mila Sánchez Mayor

unread,
Mar 16, 2022, 5:46:49 AM3/16/22
to Christopher Chang, plink2-users
Dear plink users,

I want to convert my data from bed to ped. I looked for it but I got the error message you can see below. To split the data takes too much space in the hard disk. Another advice?

Thanks in advance.


Mila 


./plink19 --bfile QC2snpNoRelated1 --threads 500 --recode --tab --out QC2snpNoRelated2

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to QC2snpNoRelated2.log.
Options in effect:
  --bfile QC2snpNoRelated1
  --out QC2snpNoRelated2
  --recode
  --tab
  --threads 500

Note: --tab flag deprecated.  Use "--recode tab ...".
128435 MB RAM detected; reserving 64217 MB for main workspace.
9820496 variants loaded from .bim file.
276470 people (129120 males, 147350 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 276470 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.99713.
9820496 variants and 276470 people pass filters and QC.
Note: No phenotypes present.
Error: --recode does not yet support multipass recoding of very large files;
contact the PLINK developers if you need this.
For now, you can try using a machine with more memory, and/or split the file
into smaller pieces and recode them separately.

Christopher Chang

unread,
Mar 16, 2022, 11:53:35 AM3/16/22
to plink2-users
The 16 Mar 2022 development build includes an updated error message.
Reply all
Reply to author
Forward
0 new messages