Mismatched SNPs question

384 views
Skip to first unread message

Jeremy

unread,
Jun 15, 2018, 7:51:14 PM6/15/18
to PRSice
Hello,

I've been able to get PRSice running and have been able to produce polygenic risk scores that closely match scores we have generated with other approaches. While I've been able to do this using a few different base files, there is one I'm working on now that gives me an unusually large number of mismatched SNPs. I have attached the log file.

The genome build is the same between base and target. I've also loaded the SNP, CHR, POS, A1 and A2 from both base and target into R do do some exploring. First, I removed indels and ambiguous snps. I find that the positions all match, and when I allow for strand flips or swapped effect alleles, all of the remaining SNPs (6408876) seem to match. I've tried running PRSice with and without specifying chr and pos (specifying --no-default when not including these). Any ideas what else could be causing the mismatch?

Thank you for your help,
Jeremy 
PRSice.log

Sam Choi

unread,
Jun 18, 2018, 11:33:12 AM6/18/18
to PRSice
A possibility might be mismatched allele between the base, target and reference genotypes. I might need to have a deeper look to identify that. If PRSice cannot "match" the alleles by flipping, then it will still be considered as mismatched


Jeremy

unread,
Jun 18, 2018, 12:09:01 PM6/18/18
to PRSice
Hi Sam,

Thanks for the reply. I've attached a small test subset that seems to demonstrate the problem. This includes the base summary stats, and then a subset of SNPs from the .bim files of the target and reference.

The base dataset has 7 SNPs and the target has 4 SNPs. It looks like 4 variants are not found in the target dataset, which is correct (3 are not there, and 1 is an ambiguous SNP which would have been filtered out). This leaves 2 SNPs left, and then rs3094315 is identified as a mis-matched variant and excluded. It looks like the position and alleles should be ok though, the base and target are both A/G and the reference is G/A. Are there any other things I should check for that could be causing the mismatch?

Thanks again,
Jeremy
base_subset.txt
reference_subset.txt
target_subset.txt
test_PRScise.log
test_PRScise.snp

Sam Choi

unread,
Jun 18, 2018, 6:36:29 PM6/18/18
to PRSice
Thanks, this will actually be very useful. I will look into it tomorrow.

Jeremy

unread,
Jul 9, 2018, 7:25:35 PM7/9/18
to PRSice
Hi Sam,

I just wanted to follow-up and see if you were able to find anything out regarding this issue?

Thanks,
Jeremy

Sam Choi

unread,
Jul 10, 2018, 7:19:21 PM7/10/18
to PRSice
Oh, no... sorry, I completely forgot this. Am in the process of fine tuning the codes. Let me have a look once I fix my current bug...

Sam Choi

unread,
Jul 10, 2018, 8:53:39 PM7/10/18
to PRSice
Just checked using my simulated data (substitute with your files). rs3094315 is actually considered as matched in my test run. Will it be possible that it is removed by the QC that you've specified?

E.g. maf or geno missingness


Jeremy

unread,
Jul 11, 2018, 12:58:11 PM7/11/18
to PRSice
Hi Sam,

Sorry, I made a mistake earlier. Of the 7 SNPs in the base dataset, 3 non-ambiguous SNPs are found in the target dataset as expected (rs143225517, rs3094315, rs58949655). It turns out that rs58949655 is the mis-match, leaving two remaining. Then rs3094315 is filtered out due to LD clumping as expected. rs58949655 seems like it should pass standard QC thresholds in base, target and reference datasets. However, I have found that even when removing QC arguments (and specifying --no-default), it is still identified as a mismatch.

Thanks,
Jeremy

Sam Choi

unread,
Jul 23, 2018, 7:13:21 AM7/23/18
to PRSice
I've now release the latest update and it should generate a file called .mismatch which we can then use to see why there's such a problem. Maybe have a go with that?

Jeremy

unread,
Jul 30, 2018, 7:40:51 PM7/30/18
to PRSice
Hi Sam,

I created a set of plink binary files for the testing dataset with only the subset of testing SNPs from before. After doing this, the attached mismatch file shows that both rs3094315 and rs58949655 are mismatches. It looks like the effect and non-effect alleles are swapped between the target and base files. I had thought that these cases were appropriately swapped when doing the scoring, and so could be identified as matches, but maybe not? Let me know if I can provide any other information.

Thanks,
Jeremy
target_plink.bim
base_subset.txt
test_PRScise.mismatch
test_PRScise.log

Sam Choi

unread,
Jul 31, 2018, 4:37:28 PM7/31/18
to PRSice
Thanks, this let me know that the mismatch file has one misplaced \t

Back to the problem of mismatch:

The main problem here is that your base file only contains A1 but not A2. When your target file contains A1=G and A2=A and your base file contains A1=A, then PRSice will do the following

1. Is the A1 in base equals to the A1 in target? No, go to 2
2. Is the complement of A1 in base equals to the A1 in target? No, go to 3
3. Do we have the information of A2 of both base and target?  No, this is a mismatch

I guess we can always try and just flip the strand and say A1 in your base is actually A2 in the target.  I can't exactly remember why I didn't implemented that logic, maybe something to do with not feeling safe to flip when there isn't information on both alleles? I will think about it and if I can't think of any reason against that, I will implement this logic in the next interim update. 

Tomas Keller

unread,
Aug 24, 2018, 11:12:08 AM8/24/18
to PRSice
Hi Sam, 

I also got a similar issue here. I basically just split my PLINK data into two sets (ca. 900 subjects). So I imagine the alleles should be well matched between base and target data set.
However, 876505 ambiguous variant(s) are still found out of 5621888 variants. See my log file for details.

Thank you very much for the help.

All the best
Tom
PRSice.log

Sam Choi

unread,
Aug 24, 2018, 3:09:08 PM8/24/18
to PRSice
Splitting file into Half has nothing to do with ambiguous variant. Any variants with A/T, G/C will be removed. If you want to keep them, you can use --keep-ambig

Tomas Keller

unread,
Aug 24, 2018, 5:47:18 PM8/24/18
to PRSice
I have just double checked and the alleles are well matched (Both A1, A2 are consistent) between base and target data set.
However, 876505 ambiguous variant(s) are still found out of 5621888 variants. See my previous log file for details.
Many thanks.

Sam Choi

unread,
Aug 24, 2018, 7:12:48 PM8/24/18
to PRSice
Ambiguous SNP has nothing to do with matching. If a SNP is A and T in your target, or is G and C, we will remove it no matter if the same encoding is used in the base. To disable this behavior, you need to use --keep-ambig

Tomas Keller

unread,
Aug 26, 2018, 4:19:42 PM8/26/18
to PRSice
Thank you so much, 
-Tomas

Tomas Keller

unread,
Sep 2, 2018, 5:03:08 AM9/2/18
to PRSice
Hi Sam, 

I wonder whether '--cov-factor' is disable? I did include "--cov-factor sex" and "sex" is already in my cov.txt file.
See the following warning in the attached log file:
--------------
Warning: Covariate(s) missing from file: sex. Header of 
         file is: FID IID PC1 PC2 PC3 PC4 PC5 PC6 PC9 PC10 sex
--------------
Thank you very much for your information.
Best,
Tom

PRSice2.log

Sam Choi

unread,
Sep 2, 2018, 6:04:00 PM9/2/18
to PRSice
You might want to check if your header contain a special character at the end (did you work on Windows? Or did your phenotype / covariate file generated in a window machine?  They tends to have strange special characters at the end of line (^M) and for some reason, I haven't been able to remove them)
 

Tomas Keller

unread,
Sep 3, 2018, 11:49:28 AM9/3/18
to PRSice
Hi Sam, 
No, the cov file is generated in a linux machine. Also, no special character at the end..Also integer and float type of number are tried, it did not work.

Sam Choi

unread,
Sep 4, 2018, 7:12:09 AM9/4/18
to PRSice
I might need the first few lines of your covariate file to check what's the problem.

If you don't mind, could you send me the "head" of your covariate file?

Alternatively, you can add an extra column to your file using R and see if the error goes away 

Tomas Keller

unread,
Sep 5, 2018, 6:08:17 AM9/5/18
to PRSice
Hi Sam, 

Please find the covariate file in the attachment. 
Adding an extra column using R however did not solve the problem.

Thank you!

header.txt

Sam Choi

unread,
Sep 5, 2018, 6:02:07 PM9/5/18
to PRSice
Got it, forgot to remove end of line character from the string when parsing the covariate file header. 

I have now push an interim update. You should be able to use that without problem. 

Tomas Keller

unread,
Sep 6, 2018, 3:53:07 AM9/6/18
to PRSice
Thank you very much.

Tomas Keller

unread,
Nov 29, 2018, 1:24:18 PM11/29/18
to PRSice
Dear Sam, 

I'm wondering why the score in the file "PRSice.best" is exactly the same even if I specified "cov-file" differently as below
1. with 20 PCs and sex >>>  (with --cov-col @PC[1-10],sex and --cov-factor sex \)
2. with 20 PCs >>>  (with --cov-col @PC[1-10] \)
3. no 'cov-file' 
But interestingly, "PRS.R2" in the file "PRSice.summary" is not the same. Often, R2 is lowest when adjusting the covariates.

Thank you very much for your input!

Sam Choi

unread,
Nov 29, 2018, 5:31:19 PM11/29/18
to PRSice
That’s because he calculation of PRS is independent of the covariates. The covariates is used only in the regression analysis

Tomas Keller

unread,
Nov 30, 2018, 3:56:52 AM11/30/18
to PRSice
Thank you! 
If I use this "PRS score" for other association studies, I need to adjust covariates of interest again? such as, PC and sex.

Sam Choi

unread,
Nov 30, 2018, 7:39:41 PM11/30/18
to PRSice
Yes
Reply all
Reply to author
Forward
0 new messages