remove duplicate IDs from larger .ped file

100 views

Skip to first unread message

David

unread,

Dec 10, 2021, 10:31:46 AM12/10/21

to plink2-users

Here a question about how to remove duplicate IIDs from a large .ped file with R.
The data concerns aprox 2330 subjects genotyped for about 700.000 SNPs with an Illumina array. The whole .ped file is about 7 Gb.

We have an approach that works if the number of SNPs in the .ped file is small, as 'R' is able to open the .ped file with read.table(). For example when I only want to look at a subset of snps. Here removal of duplicates works well: 36 duplicate IID are removed and the .ped file is saved subsequently.

However removal of duplicates from the whole .ped file 'R' doesn't work as R keeps processing until I hit stop. I also tried opening the .ped file with read.pedfile from the trio library, with the same result.

Could anybody advice me on how to remove duplicates in a propper way, either with or without R? Is this a computer/memory problem or something in my approach maybe?

Thank you!

PM reason to suspect duplicates is an error on:
geno <- read.plink(gwas.fn$bed, gwas.fn$bim, gwas.fn$fam, na.string = ("-9"))

Error in read.plink(gwas.fn$bed, gwas.fn$bim, gwas.fn$fam, na.string = ("-9")) :
couldn't create unique subject identifiers

Christopher Chang

unread,

Dec 10, 2021, 12:15:38 PM12/10/21

to plink2-users

The .ped file format has been obsolete for close to a decade. It isn't even usable with any existing plink 2.0 build. You should almost certainly convert to plink binary format (.bed+.bim+.fam) or something similar, and never look back.

Reply all

Reply to author

Forward

0 new messages