duplicate id names in 1000 genomes phase 3 data

190 views
Skip to first unread message

Jing Zhang

unread,
Apr 12, 2018, 1:28:40 PM4/12/18
to plink2-users
Hi,

I am using the 1kg phase 3 data to calculate LDs. After removing duplicate positions and nonbiallelic SNPs, I met another problem - duplicate RS ids. 
 Plink will report an error message Error: Duplicate ID 'rs11952502'. I found that there are duplicate ids on different chromosomes, for example
10 rs11952502 0 45889392 T C
23 rs11952502 0 75896726 G T

I checked all chromosomes and there are 20+ such SNPs. Do you suggest to remove all such SNPs? Alternatively, is it possible that in PLINK, I can rename the SNPs using chrom:pos?

I have another concern. I input a list of SNPs using SNP ids, and I am trying to output all SNPs within 500k with r2>0.8 and MAF >1%. I can use the following 2 ways.

1. extract all SNPs with >1% MAF. input my SNP list with ID, and use --ld-snps 
2. use -- ld-snps directly all on SNPs and pick the ones with MAF > 1%. 

My question is will they give the same results? Is it possible that the input SNP MAF <1% in 1kg panel (but>1% in the GWAS panel), so it can not find any LD snps if I do the pruning first?

Thank you very much!

Christopher Chang

unread,
Apr 12, 2018, 1:34:52 PM4/12/18
to plink2-users
1. I normally use plink 2.0's --set-all-var-ids flag to give all variants chrom/pos/ref/alt-based names.  This deals with both unnamed SNPs and the occasional duplicate rsID.
2. Since you aren't changing the set of samples, estimated MAFs shouldn't change.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages