A score file with chr & positions instead of SNP IDs

181 views
Skip to first unread message

Roni Haas

unread,
Jan 31, 2023, 11:23:24 AM1/31/23
to plink2-users
Hello,
I am calculating PRSs using the command:

 plink2 \
--bfile plink_file \
--score score_file 1 2 3 header cols=+scoresums,-scoreavgs \
--out PRS_sum

Currently,  the columns in my score_file are: SNP IDs, effect allele, beta weights.
I was wondering if possible to give instead of SNP IDs in the first column, chrs and positions (with an extra column)
e.g. a score file with 4 columns: chr, position,  effect allele, beta weights. Or a similar structure.
 
If yes, which flags should be used?

Thank you!
Roni

Christopher Chang

unread,
Feb 1, 2023, 11:19:39 AM2/1/23
to plink2-users
This isn't directly supported, but it should be straightforward to use e.g. awk to generate a file with the columns in the supported order.

Christopher Chang

unread,
Feb 1, 2023, 11:24:35 AM2/1/23
to plink2-users
Actually, sorry, your existing file should be supported, I had briefly forgotten that the specified column numbers aren't required to be ascending.  You just need to replace the "1 2 3" part of your command.  As noted in the documentation:
  • The input file must have exactly one line per scored allele. Variant IDs are read from column #i and allele codes are read from column j, where i defaults to 1 and j defaults to i+1.
  • By default, a single column of coefficients is read from column #k, where k defaults to j+1. To specify multiple columns, use --score-col-nums.

Christopher Chang

unread,
Feb 1, 2023, 11:32:08 AM2/1/23
to plink2-users
...er, that wasn't responsive to your question either.  Maybe I should wait until after coffee to write these... anyway, --score does require variant IDs, but you have the option of using --set-all-var-ids to change the variant IDs in your plink2 fileset to be based on chr/position.

Roni Haas

unread,
Feb 1, 2023, 12:36:55 PM2/1/23
to plink2-users
Thank you so much for the help! :)

I have tried --set-all-var-ids in two versions (see below) but that is still a bit problematic, given that I don’t want to skip any affecting variants.
1)    For option 1, I used the flag --set-all-var-ids @:#:\$r:\$a to create chr:pos:major:minor. I organized the score file the same.
But the problem is, that the order of the major and minor alleles is sometimes swapped in the score file, so many variants are not identified and being skipped.
2)    As option 2, I tried --set-all-var-ids @:#: to have only chr:pos instead of SNP IDs. That option isn’t perfect too, since I have multiallelic sites in the plink file (some of them are the actual affecting alleles that I need) and get and error.

I was therefore hoping  to find a way to rely on chrs and locations (and that the software will verify the allele order), instead on only one field of SNPids. In any other way, I am losing data (that I can’t afford for this project).

Any helpful suggestions that I haven’t thought about?

Thank you!
Roni

Christopher Chang

unread,
Feb 1, 2023, 12:55:17 PM2/1/23
to plink2-users
You can use $1/$2 instead of $r/$a with --set-all-var-ids to specify alphabetical allele order.

Sinuous Walkers

unread,
May 30, 2023, 2:29:27 PM5/30/23
to plink2-users
Is that solution ambiguous for multiallelic indels?

E.g., chr21:5030240:A:AC and chr21:5030240:AC:A both exist; one an insertion and the other a deletion. But they will become identical after lexical sorting.

Christopher Chang

unread,
May 30, 2023, 2:32:59 PM5/30/23
to plink2-users
Yes, that solution is only appropriate for major/minor-coded SNPs in plink 1.x files.  If indels are coded major/minor, you probably need to backtrack to the last point in your workflow where they weren't, and then use plink 2.0 instead of 1.x whenever possible.

Noe Reyna

unread,
Jul 24, 2023, 1:31:13 AM7/24/23
to plink2-users

Hello, 


We're alphabetizing SNP IDs based on variants from a VCF file using plink2.

A command for example is:

# create plink format files

plink2 \
    --set-all-var-ids @:#:\$1:\$2 \
    --vcf samples.vcf.gz \
    --out samples_alphabetical \
    --make-bed \
    --new-id-max-allele-len 70 missing


 

We're encountering an issue where plink creates SNP IDs based on only one ALT allele in multiallelic sites.

For example, given a multiallelic site in a vcf file, with two ALT alleles:

1     1234  .     TA    TAA,T

The created .pvar file, has an SNP ID based on only one of the alternative alleles – (here it's keeping the TAA alternative allele while ignoring the T alternative allele).

1     1234  1:1234:TA:TAA     TA    TAA,T

However, we would like to have in the .pvar file a SNP ID relying on the other ALT allele:

1:1234:T:TA

Is there a way to make plink generate two lines in the .pvar file in case of multiallelic sites, one for each ALT allele?

In this case, the .pvar files will have 2 lines for this position

1     1234  1:1234:TA:TAA  …

1     1234  1:1234:T:TA.  …

Alternatively, can we control somehow which ALT allele is taken for the creation of the SNP IDs?  


Thank you!

Noe

Christopher Chang

unread,
Jul 24, 2023, 10:20:55 AM7/24/23
to plink2-users
If you want to "split" multiallelic variants, you can use either "bcftools norm --multiallelics -", or the not-officially-documented "--make-pgen multiallelics=-" (not officially documented because the corresponding "join" function hasn't been implemented in plink2 yet).

As for controlling which ALT allele is used in the SNP ID, plink2 doesn't currently provide options beyond $r/$a/$1/$2.  However, it's usually straightforward to write a script of your own to change the IDs in the way you want, when --set-all-var-ids is not sufficient.
Reply all
Reply to author
Forward
0 new messages