--score for polygenic risk score with PLINK 2.0

3,701 views
Skip to first unread message

Elayna Kirsch

unread,
Sep 4, 2018, 1:41:23 PM9/4/18
to plink2-users
I was wondering if the --score function works differently in plink 2.0 than plink 1.9 with their respective binary files (.pgen, .pvar, .psam vs. .bed, .bim, .fam). I have a text file formatted as a column of SNPs, a column of risk allele, and a column of beta values called TC_aligned.txt. I conducted quality control steps on the merged 1000genomes phase 3 dataset, including controlling for missing ness of SNPs and individuals (--geno and --mind), minor allele frequency (--maf), and hardy wienberg equilibrium (--hwe). When I run ./plink2 --pfile vas all_phase3_8 --score TG_aligned.txt --out TG, it creates a TG.sscore file with columns: #IID, superpose, population, NMISS_ALLELE_CT, NAMED_ALLELE_DOSAGE_SUM, and SCORE1_AVG, where the SCORE1_AVG is either 0, 0.035. or .07. 

This is different from my output using plink 1.9 with a different reference dataset, which produced a file with FID, IID, PHENO, CNT, CNT2, and SCORE. Additionally, the TG.log file shows that --score:1 variant processed even though there are 14272710 variants loaded from the all_phase3.pvar.zst file. Would appreciate any advice!

Christopher Chang

unread,
Sep 4, 2018, 3:05:47 PM9/4/18
to plink2-users
Two possibilities come to mind re: "--score: 1 variant processed":
1. TG_aligned.txt contains Classic-Mac linebreaks which are not understood by plink, so only the first line of the file was processed properly.
2. TG_aligned.txt doesn't use the same variant IDs as your plink2 files.  Take a look at the first 10 nonheader lines of your .pvar file (decompressing first if necessary), and see if they appear to be using the same naming scheme as TG_aligned.txt.

Elayna Kirsch

unread,
Sep 4, 2018, 3:58:08 PM9/4/18
to plink2-users
I don't believe its issue #1 because PLINK 1.9 worked with this file to compute polygenic risk scores with a different genetic dataset, unless PLINK 2.0 has different requirements for line breaks. Both files use variant IDs in format of rs#####. Viewing the decompressed .pvar file in the terminal or looking at it through text editor shows what the file looks like (images attached). Does this look like a normal .pvar file?
fullsizeoutput_2dac.jpeg
fullsizeoutput_2dae.jpeg

Christopher Chang

unread,
Sep 4, 2018, 4:07:39 PM9/4/18
to plink2-users
Looks normal enough.  (And no, plink 2.0 doesn't have different requirements for line breaks.)  Can you construct a --score file with just two entries corresponding to variants in the .pvar file, and send me the .log file of a run where --score reports only 1 variant processed?

Elayna Kirsch

unread,
Sep 4, 2018, 5:32:27 PM9/4/18
to plink2-users
I created a fake .txt file with two variants that were both in the all_phase3_9.pvar file and still only one SNP was processed. Attached is the log file of one variant processed.
sample.log

Christopher Chang

unread,
Sep 4, 2018, 5:34:29 PM9/4/18
to plink2-users
Okay, can you send me that sample_GRS.txt so I can try to reproduce this on my copy of 1000 Genomes phase 3?

Elayna Kirsch

unread,
Sep 4, 2018, 5:36:14 PM9/4/18
to plink2-users
Yep, here it is. 
sample_GRS.txt

Christopher Chang

unread,
Sep 4, 2018, 6:08:28 PM9/4/18
to plink2-users
Hmm, now I'm stumped; I've tried to replicate this problem on both my Mac and a Linux test machine using the 64-bit 30 Aug build and the same memory/threads settings, but I'm always getting 2 variants processed.  I will retry with VirtualBox running FreeBSD when I get home tonight.

Christopher Chang

unread,
Sep 4, 2018, 6:21:10 PM9/4/18
to plink2-users
Meanwhile, if you run --extract + --make-pgen on those two variant IDs instead of --score, do 1 or 2 variants remain?

Elayna Kirsch

unread,
Sep 5, 2018, 8:44:47 AM9/5/18
to plink2-users
I will try the extract and make-pgen and will let you know. Thank you!

Elayna Kirsch

unread,
Sep 5, 2018, 9:15:12 AM9/5/18
to plink2-users
Did you edit the copy of the phase 3 data at all? Or is it the files directly downloaded from the PLINK 2.0 Resource page?

Elayna Kirsch

unread,
Sep 5, 2018, 9:45:24 AM9/5/18
to plink2-users
When I run the --extract and --make-pgen it finds 2 variants. I will try to redo the --score on my mac computer and see what happens.

Elayna Kirsch

unread,
Sep 5, 2018, 11:02:20 AM9/5/18
to plink2-users
Okay for one file it seemed to work, but the other text files with variants I am trying to use gives me this error: 

11112370 variants loaded from all_phase3_9.pvar.zst.
2 categorical phenotypes loaded.

Calculating allele frequencies... done.

Error: No valid variants in --score file.

End time: Wed Sep  5 10:26:29 2018


I believe it probably has something to do with the way the text files are formatted, even thought the file that worked and the files that are giving me errors should be formatted the same way. Attached is the TC_aligned file that is working and the TG_aligned file that is not working. If you see any clear differences or reasons why one is not compatible with PLINK please let me know.
TC_aligned.txt
TG_aligned.txt

Christopher Chang

unread,
Sep 5, 2018, 11:54:50 AM9/5/18
to plink2-users
Hmm, very strange.  I'm getting the same top-level result of TC_aligned.txt working and TG_aligned.txt not working (using all_phase3.pgen filtered with --maf 0.01) on my Mac, but the reasons are different:

PLINK v2.00a2 SSE4.2 (4 Sep 2018)              www.cog-genomics.org/plink/2.0/

(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3

Logging to plink2.log.

Options in effect:

  --pfile vzs test1

  --score TC_aligned.txt


Start time: Wed Sep  5 08:51:34 2018

16384 MiB RAM detected; reserving 8192 MiB for main workspace.

Using up to 8 compute threads.

2504 samples (1271 females, 1233 males; 2497 founders) loaded from test1.psam.

14273419 variants loaded from test1.pvar.zst.

2 categorical phenotypes loaded.

Calculating allele frequencies... done.

Warning: 63 --score file entries were skipped due to missing variant IDs, and 3

were skipped due to mismatching allele codes.

(Add the 'list-variants' modifier to see which variants were actually used for

scoring.)

--score: 644 variants processed.

--score: Results written to plink2.sscore .

End time: Wed Sep  5 08:51:41 2018


PLINK v2.00a2 SSE4.2 (4 Sep 2018)              www.cog-genomics.org/plink/2.0/

(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3

Logging to plink2.log.

Options in effect:

  --pfile vzs test1

  --score TG_aligned.txt


Start time: Wed Sep  5 08:50:58 2018

16384 MiB RAM detected; reserving 8192 MiB for main workspace.

Using up to 8 compute threads.

2504 samples (1271 females, 1233 males; 2497 founders) loaded from test1.psam.

14273419 variants loaded from test1.pvar.zst.

2 categorical phenotypes loaded.

Calculating allele frequencies... done.


Error: Variant ID 'rs12047226' appears multiple times in --score file.


Can you post the full .log files from your two runs?

Elayna Kirsch

unread,
Sep 5, 2018, 1:21:41 PM9/5/18
to plink2-users
Very odd. Now when I run the TG_aligned file this is my log output:

PLINK v2.00a2 AVX2 (26 Aug 2018)
Options in effect:
  --out TG
  --pfile vzs all_phase3_9
  --score TG_aligned.txt header list-variants

Hostname: Laynies-MBP.wireless.yale.internal
Working directory: /Users/layniekirsch/Desktop/1kg_phase_3
Start time: Wed Sep  5 13:18:11 2018

Random number seed: 1536167891
8192 MiB RAM detected; reserving 4096 MiB for main workspace.
Using up to 4 compute threads.
2504 samples (1271 females, 1233 males; 2497 founders) loaded from
all_phase3_9.psam.
11112370 variants loaded from all_phase3_9.pvar.zst.
2 categorical phenotypes loaded.
Calculating allele frequencies... done.
Error: Empty --score file.

End time: Wed Sep  5 13:18:24 2018

When I re-run the TC_aligned file, this is my output:

PLINK v2.00a2 AVX2 (26 Aug 2018)
Options in effect:
  --out TC
  --pfile vzs all_phase3_9
  --score TC_aligned.txt header list-variants

Hostname: Laynies-MBP.wireless.yale.internal
Working directory: /Users/layniekirsch/Desktop/1kg_phase_3
Start time: Wed Sep  5 13:19:55 2018

Random number seed: 1536167995
8192 MiB RAM detected; reserving 4096 MiB for main workspace.
Using up to 4 compute threads.
2504 samples (1271 females, 1233 males; 2497 founders) loaded from
all_phase3_9.psam.
11112370 variants loaded from all_phase3_9.pvar.zst.
2 categorical phenotypes loaded.
Calculating allele frequencies... done.
Warning: 272 --score file entries were skipped due to missing variant IDs, and
2 were skipped due to mismatching allele codes.
--score: 435 variants processed.
Variant list written to TC.sscore.vars .
--score: Results written to TC.sscore .

End time: Wed Sep  5 13:20:11 2018

Christopher Chang

unread,
Sep 5, 2018, 1:37:07 PM9/5/18
to plink2-users
I suspect there's some sort of uninitialized-variable bug that is causing (i) our results to diverge, and (ii) your results to vary a bit between runs, though I'm still unable to reproduce the problem on my Mac after copying the --memory/--threads/--score modifier settings.

Can you see if adding "--threads 1" makes a difference?  Meanwhile, I will prepare a debug build for you to use.

Elayna Kirsch

unread,
Sep 5, 2018, 1:58:01 PM9/5/18
to plink2-users
I think I found the reason why it is not working. When I look at the TC_aligned file (which works) in the terminal, it looks like 3 separated columns:

Laynies-MBP:1kg_phase_3 layniekirsch$ more TC_aligned.txt

rs2290547       G       0.008

rs79949326      T       -0.01

rs2737203       C       -0.009

rs7115089       G       -0.014

rs10468017      T       -0.05

rs1699337       G       0.023

rs28709068      G       0.041

rs1998013       C       0.1


However, when I do the same for the TG_aligned.txt file it looks like this, even though when opened in the text editor it looks normal:

Laynies-MBP:1kg_phase_3 layniekirsch$ more TG_aligned.txt

rs2290547       A       -0.007rs79949326        C       0.008^Mrs2737203        C       -0.002^Mrs7115089       C       0.006^Mrs10468017       T       -0.034^M^Mrs1699337       G       0.01^Mrs28709068        A       -0.003^Mrs1998013

       T       -0.019^Mrs838880        C       0.013^Mrs1610095        G       0.007^Mrs5756931        T       0.027^Mrs1030431        A       0.017^Mrs10102164


These files were created the exact same way. Do you have any knowledge of how to fix the format of the second file to look like the first?

Thank you!

Christopher Chang

unread,
Sep 5, 2018, 2:06:05 PM9/5/18
to plink2-users
So this is the Classic-Mac linebreaks problem after all.  A quick solution is "cat TG_aligned.txt | tr '\r' '\n' > TG_aligned_fixed.txt".  To avoid generating this type of file in the future, you'll probably need to change a linebreak-saving setting in the text editor you're using.

(But there's something I don't understand: the TG_aligned.txt file you posted did not have this problem.  Maybe the Google Groups infrastructure is automatically translating the linebreaks?)

Elayna Kirsch

unread,
Sep 5, 2018, 2:10:08 PM9/5/18
to plink2-users
I am not sure why the file I posted worked after uploading it to google groups, but when I fixed the linebreaks in the terminal the --score in PLINK now works. Thank you for all of your help!
Reply all
Reply to author
Forward
0 new messages