Here a question about how to remove duplicate IIDs from a large .ped file with R.
The data concerns aprox 2330 subjects genotyped for about 700.000 SNPs with an Illumina array. The whole .ped file is about 7 Gb.
We have an approach that works if the number of SNPs in the .ped file is small, as 'R' is able to open the .ped file with read.table(). For example when I only want to look at a subset of snps. Here removal of duplicates works well: 36 duplicate IID are removed and the .ped file is saved subsequently.
However removal of duplicates from the whole .ped file 'R' doesn't work as R keeps processing until I hit stop. I also tried opening the .ped file with read.pedfile from the trio library, with the same result.
Could anybody advice me on how to remove duplicates in a propper way, either with or without R? Is this a computer/memory problem or something in my approach maybe?
Thank you!
PM reason to suspect duplicates is an error on:
geno <- read.plink(gwas.fn$bed, gwas.fn$bim, gwas.fn$fam, na.string = ("-9"))
Error in read.plink(gwas.fn$bed, gwas.fn$bim, gwas.fn$fam, na.string = ("-9")) :
couldn't create unique subject identifiers