Hello,
I am working with 1000 Genomes Data Phase 3; specifically chr 22 has many variants with two rsIDs separated by a semicolon (see example 1) in the *.bim file. Using PLINK/1.9b_5.2.
Example 1:
22 rs587638893 0 16050568 A C
22 rs587720402 0 16050607 A G
22 rs587593704 0 16050627 T G
22 rs587670191 0 16050646 T G
22 esv3647175;esv3647176;esv3647177;esv3647178 0 16050654 <CN3> A
OR
22 rs539868657;rs561027534 0 16349650 T G
22 rs562311818;rs377092600 0 16404838 G GA
22 rs374006257;rs200929253 0 16577044 T TG
1) When filtering snps with --exclude and --extract, does plink recognizes all the RSIDs at this position?
2) I have noticed that there are many variants with duplicate variant IDs. This causes plink to crash when I run --clump option. So, I first run plink --list-duplicate-var and generate a list of duplicate IDs. However this does not include variants that have the same variant ID at the same position (Example 2). Therefore I use bash (cut -f2 $bimfile | uniq -D > remove_these_snps.txt) to add these snps to the snps from --list-duplicate-var to ultimately filter out with --exclude. Does this makes sense? Should this even be an issue for plink or am I doing something wrong?
Thank you for your help!
Example 2:
22 rs563541510 0 18078898 AAAAT A
22 rs563541510 0 18078898 AAAATAAAT A
22 rs563541510 0 18078898 AAAATAAATAAAT A
22 rs563541510 0 18078898 AAAATAAATAAATAAAT A