--update-name mixes chr:pos with rsID

2,090 views
Skip to first unread message

Diandra Brkic

unread,
Feb 3, 2021, 1:43:10 PM2/3/21
to plink2-users
Hi there, 

I am a plink beginner. I was trying to recode chr:pos (from my .vcf file) to the SNP rsID and I think there is something very odd with the --out file. The steps I've followed were

1. create a list of SNPs which have a MAF > 0.1% and Imputation R2 > 0.3 from .info file
3. deal with duplicates with plink2 --rm-dup force-first list --make-bed --out name.nodup
4. --exclude duplist
5. --update-name textfile.txt

Now the output of step 5 looks very weird with V1 having rsIDs mixed with chr:pos (picture attached)
 my question would be is this normal? and should I remove the chr:pos before merging with the ref data?
THANK YOU

Diandra


Screenshot 2021-02-03 at 19.41.58.png

Christopher Chang

unread,
Feb 3, 2021, 1:54:40 PM2/3/21
to plink2-users
A more common approach is to use --set-all-var-ids to make *all* variants have position/allele based IDs, and then use --recover-var-ids whenever you want to return to rsIDs.

Diandra Brkic

unread,
Feb 3, 2021, 3:00:15 PM2/3/21
to plink2-users
Ok cool, thank you. I know this is more common, but to be able to merge with 1000 genome ref sample, and subsequently perform my --assoc analysis with a subset of SNPs that I am interested in, I need my data to be in snp rsID format and not in chr:pos... if that makes sense?

Just to make sure I got this right: this means in step 3 add --set-all-var-ids. so it would be:
plink2 --rm-dup force-first list --set-all-var-ids --make-bed --out name.nodup ; right? 

thanks again for your help

Christopher Chang

unread,
Feb 3, 2021, 3:16:39 PM2/3/21
to plink2-users
* You need to specify a template string for --set-all-var-ids; please read the documentation.  Also read the documentation for --recover-var-ids which I linked to in my previous response, since that lets you get the rsIDs back whenever you want.
* --assoc has been obsolete for more than a decade, since it doesn't let you use e.g. principal-component covariates to correct for large-scale population structure.  You almost certainly want to use --glm (or --linear/--logistic in plink 1.9) instead.

Diandra Brkic

unread,
Feb 3, 2021, 3:40:11 PM2/3/21
to plink2-users
As you can tell I am a total beginner, I missed the string part. Thank you so much :)
* I am not sure I get the back/forth part --set-all-var-ids / --recover-var-ids), all I would like to do is recode the chr:pos into their respective rsIDs - even if that means taking the 'stringent' route and getting rid of duplicates. I know this is not ideal, but since my ultimate goal is to just focus on a subset of specific SNPs I thought this was the simpler way? Also merging with ref population which has only rsID would be easier. Or am I completely off the road here?
* I am aware of the --assoc limitations, but this first step was for me to try to understand the QC and how to add continuous phenotypes to the data. I am hoping to get to the --logistic part but will take some time.


Christopher Chang

unread,
Feb 3, 2021, 5:07:42 PM2/3/21
to plink2-users
Ok, so the main premise of this discussion is that you don't have rsIDs for some of your variants, and your --update-name file doesn't help.  Given that, you're probably best off using chr:pos:alleles for everything and not looking back.  Ignore my comment about --recover-var-ids, and instead convert your 1000 Genomes dataset using the same --set-all-var-ids setting before merging.  This is substantially easier than trying to replicate the chr:pos:alleles -> rsID mapping used in the 1000 Genomes dataset, because that can vary based on dbSNP build and filtering criteria.

Meanwhile, --linear/--logistic is not any harder to get started with than --assoc.

Diandra Brkic

unread,
Feb 3, 2021, 6:31:42 PM2/3/21
to plink2-users
Perfect, that sounds reasonable. Thank you so much, you just spared me a massive headache.

Will definitely try --linear/--logistic, then.

Reply all
Reply to author
Forward
0 new messages