PRS calculation

812 views
Skip to first unread message

Priyadarshini Thirunavukkarasu

unread,
Aug 15, 2017, 3:30:40 PM8/15/17
to PRSice
Hi,
I have been reading papers  to know how PRSice calculates PRS. In the GWAS dataset, we have odds ratio already calculated for each SNP, then using this odds ratio value, we calculate PRS for each subject in GWAS dataset. In the target phenotype, we predict using the PRS. Is it right? or do we use data from target phenotype to calculate PRS in base phenotype? Confused about this part, please clarify
Thank you
Priya

Sam Choi

unread,
Aug 16, 2017, 6:56:27 PM8/16/17
to PRSice
So the procedure of PRSice, or basically the standard PRS is:

1. Obtain OR or BETA from GWAS
2. In target sample (independent from the GWAS), we multiply the OR/BETA to the number of risk alleles for each individual 
    - This generate the PRS

Problem is, we don't know what SNPs to include in the analysis. So we construct the PRS with SNPs under different P-value thresholds. 
This gives us large number of PRS.

Then to identify the "best" PRS, we regress the PRS against the phenotype of the target samples. 

PRS that best predict the phenotype (with the largest R2) is considered as the "best"

lee.s...@ada.com

unread,
Feb 5, 2018, 9:36:35 AM2/5/18
to PRSice
Hi Sam, 

Thanks for putting together the resource, it is really cool and am eager to use it. 

I wanted to ask how many target samples are necessary to produce a somewhat reliable PRS. Do all target samples have the same phenotype? 

I also want to ask, do you manipulate the OR/BETA in any way in PRSice? For example, if the odds ratio for allele 1 is X, then the odds ratio for allele 2 is 1/X. So if you took into account both alleles without manipulation, the heterozygote would always have a risk. Do you in any way take allele frequencies into account? 

Thanks, 

Lee

lee.s...@ada.com

unread,
Feb 5, 2018, 10:50:16 AM2/5/18
to PRSice
I need to make a correction -- I wanted to say that the heterozygote would have a risk of 1. 

Sam Choi

unread,
Feb 5, 2018, 7:10:59 PM2/5/18
to PRSice
Sorry, I don't think I completely understand what you meant. 

Currently, we are following PLINK's way of calculating the PRS. 

For Odd Ratios, we will first log transform it into BETAs

We them simply multiply the BETA to the allele dosage.

So take for example, a BETA of X
then the corresponding PRS generated will be

0/0 = 0
0/1 = 0.5 * X
1/1 = X

We also support other models (e.g. dominance, recessive, and heterozygous) You can refer to our document for more info

Please let me know if my understanding to your question is correct

lee.s...@ada.com

unread,
Feb 8, 2018, 11:09:20 AM2/8/18
to PRSice
OK let me clarify. If you have the odds ratio for the 'risk allele', then it will be relative to the 'reference allele'. So the odds ratio for the 'reference allele' is the reciprocal of that for the 'risk allele'. 

ORrisk = ODDSrisk ODDSreference

ORreference = ODDSreference / ODDSrisk

Therefore, if you simply multiply the ORrisk by the allele dosage (AD), you fail to account for the reference AD and ORreference. Shouldn't you also have an AD for the reference allele? 

There is then an issue if you also take into account the ORreference. For a heterozygote you will have ADrisk * ORrisk * ADreference * ORreference = 1. This is actually incorrect, because it fails to account for necessary information regarding that SNP. Each odds ratio is allelic, and because we are diploid, need to be corrected to create an allelic odds ratio specific for that SNP that takes into account the allele frequencies of the alleles at that SNP. 

Maybe there is something I am missing from the PLINK calculation. I will have a look at the documentation, could you give me a link to their method if you have one?

Anyways, this question was actually secondary to the primary question I wanted to ask, which was how many target samples are necessary to produce a somewhat reliable PRS? Do all target samples have to have the same phenotype? 

Sam Choi

unread,
Feb 9, 2018, 10:43:57 AM2/9/18
to PRSice
I guess the easiest way to explain is that the PRS is calculate with respect to the risk allele, or more precisely, the effective allele

So for example, if my effective allele has an OR of 2, then it means that for sample with 0 copy of this allele should have a score of 0, sample with 1 copy of this allele should have a score of 0.347 and sample with 2 copy of this allele should have a score of 0.693 (take nature log of the OR)

Now, what happen to the non-effective allele? We can write it down as follow

Allele Type     | Effect of 0 copy | Effect of 1 copy | Effect of 2 copy |
Effective         |            0            |          0.347        |          0.693        |
Non-effective  |           0.693      |         0.347        |              0            |

This is because when you have 0 copy of non-effective allele, you should have 2 copy of effective allele, so on so forth.

So you should never have a risk of 1 for heterozygous, because you will only assume one of the allele has an effect and calculate the PRS accordingly. This is also why it is important for users to specify the --A1 and --A2 column correctly. As otherwise, the effect will be wrongly assigned to samples, causing problem in the PRS calculation.

Hope I have answered your question.        



lee.s...@ada.com

unread,
Feb 12, 2018, 6:33:00 AM2/12/18
to PRSice
OK, so the main assumption is that only one allele has an effect? That is a pretty big assumption to make... 

Could you also answer my second question? How many target samples are necessary to produce a somewhat reliable PRS? Do all target samples have to have the same phenotype? 

Thanks for answering so promptly!

Lee

Sam Choi

unread,
Feb 12, 2018, 7:59:15 AM2/12/18
to PRSice
No, we are not assuming only one allele has an effect. 

What happened is, the "effect" is a relative term. For example, an OR of 2 for allele A suggest that it has 2 time the effect compared to allele B. 

So when you are calculating the PRS, you are saying that:

Because this sample has two copy of allele A, it should have a effect of 2 compared to another sample who has two copy of allele B.




For the sample size, that really depends on your trait, on the power of the summary statistics etc.

( Just like normal regression analysis, there isn't a hard magic number of power, you will have to estimate that based on your data)

Basically, in PRSice, we use the target phenotype for "tuning" the parameter - p-value threshold. 

As long as you have sufficient sample with the target phenotype for the regression analysis, then it is fine.

Or if you don't have sufficient samples with the target phenotype, you can try to perform the pseudo validation proposed in this paper


lee.s...@ada.com

unread,
Feb 12, 2018, 9:52:08 AM2/12/18
to PRSice
Exactly, both alleles have an effect, so why did you say: 

"So you should never have a risk of 1 for heterozygous, because you will only assume one of the allele has an effect and calculate the PRS accordingly"

Allele B will have the inverse effect as Allele A, so if you use the normal effects, they will cancel out and equal 1 for the heterozygote. 

I was under the impression that the odds ratio was allelic, for a single allele. The total number of alleles for a marker are summed across both chromosomes in the entire population. The difference in alleles between the control and test populations are then compared to get the effect. The effect is relative to each single allele, not two. If you assume that having two 'risk alleles' is equal to the odds ratio of the allele * allele dosage, then the individual with two copies of allele A is considered to have no effect, which would indicate the average risk of an individual in the population. However, this is NOT true because the alleles have some frequency in the population, and the risk of the average individual has to account for the risk alleles in the population as well. Therefore, an individual with two copies of allele A will not have a relative risk of 1, he will always have a lower relative risk. 

Sam Choi

unread,
Feb 12, 2018, 5:25:50 PM2/12/18
to PRSice
You need to remember that when performing a GWAS, you are calculating the "relative" risk.

The effect size is calculated as the relative effect of allele A over allele B. (so there isn't an absolute effect for allele A/B)

When you are calculating PRS, Someone with 2 copy of allele B should have 0 risk compare to people with 2 copy of allele B, and someone with 2 copy of allele A will have X times higher/lower risk when compare to someone with 2 copy of allele B. 

So in summary, you must understand that in GWAS, we do not have the information of the absolute risk of each allele, but rather the relative risk of the effective allele (alternative allele) over the non-effective allele (reference allele)

lee.s...@ada.com

unread,
Feb 13, 2018, 5:49:06 AM2/13/18
to PRSice
Maybe I am missing something big here. 

I understand that the odds ratio for A is a relative risk of A|B. You can also easily get the odds ratio for B as the relative risk of B|A. 

From my understanding, the odds ratio from most GWAS are expressed as SINGLE allele relative risk, not a diploid/genotypic relative risk. So if rs123 has allelic odds ratio of 2 for A, then the allelic odds ratio for B is 1/2. 

When you say that someone with 2 copies of B should have 0 risk compared to people with 2 copies of allele B, that is only possible if you have GENOTYPE specific odds ratios (the relative risk of that individual will actually be equal to 1, not 0). This can be done if all the homozygous and heterozygous genotypes in the population are counted. Then the odds ratio will be expressed in terms of the heterozygous 'healthy' genotype. Therefore, the heterozygote and homozygote risk allele can be compared to the homozygous healthy allele, and the individual with the homozygous healthy allele will have a relative risk of 1, not 0.  But in standard GWAS, this is not the output, but the single allele odds ratio. 

Paul O'Reilly

unread,
Feb 14, 2018, 5:32:39 AM2/14/18
to PRSice
Hi Lee 

I think your final messages makes clear the reason for the confusion over this between yourself and Sam. The odds ratios typically reported from GWAS on binary (e.g. case/control) outcomes relates to genotypes and not alleles. So if a SNP has genotypes CC / CT / TT, and are coded as 0 / 1 / 2, then if the OR from the GWAS is 3, then this means that the OR of CT relative to CC is 3, and of TT relative to CC is 9 (the coding could have been done the other way, with TT as 0, CT as 1, CC as 2, and then the OR would have been reported as 0.33) - so usually one of the homozygotes is the reference genotype (thus heterozygotes don't have an OR of 1 as you mentioned above - unless there happens to be equal odds across genotypes). 

Let us know if that makes sense, and do ask if you have any other questions because no doubt this thread has been interesting for others to read. Thanks!

Paul  

lee.s...@ada.com

unread,
Feb 14, 2018, 6:28:34 AM2/14/18
to PRSice
Hi Paul,

Thanks for the answer. This definitely makes sense and explains that I had made a mistake in my assumption of the GWAS output. As I have not actually done a GWAS and am using summary statistics from GWAS catalog, I misassumed that the expressed odds ratio for the 'risk allele' was allelic and not genotypic. They made no clarification on their website, except to say that the odds ratio was associated with the risk allele, which implies that it is an allelic odds ratio. 

Nonetheless, what is confusing still, is why GWAS only give one odds ratio and assume an additive model (number of risk alleles*allele dosage) instead of giving separate odds ratios for the heterozygote and homozygote risk, as you can also see in the stanford presentation link that I shared in the last post. 

Finally, I would like to continue to use your example of a SNP with genotypes CC | CT |TT, with T as the risk allele. For my own clarity, I would like to confirm that the odds ratio for TT is the OR^2, not 2*OR. 

I hope this little discussion proves to be useful for people in the future, because the genotypic/allelic odds ratios can be somewhat tricky. Even the concept of an odds ratio is difficult to understand statistically. 

Thanks again for your help!

Lee

lee.s...@ada.com

unread,
Feb 15, 2018, 5:39:40 AM2/15/18
to PRSice
Hi Paul and Sam, 

I checked with the GWAS catalog, apparently: 

"Regarding the odd ratios, in the majority of cases we report allelic ORs; however, we extract the information from the paper just as it is reported by the authors. This means we might have genotypic odd ratios reported for some publications."

So in the end, we are probably both right to some degree, and the data that is coming out of the GWAS needs to be explicitly checked to be an allelic or genotypic odds ratio. In my opinion, each study should post all odds ratios: allelic in terms of risk allele, genotypic heterozygote for one copy of risk allele, and genotypic homozygote for two copies of risk allele. 

Thanks, 

Lee

Paul O'Reilly

unread,
Feb 15, 2018, 11:50:34 AM2/15/18
to PRSice
Hi Lee

I'm quite surprised by the GWAS catalog statement because my strong guess would be that the majority of large GWAS published on binary outcomes have used logistic regression on genotypes (eg. coded 0, 1, 2), rather than having calculated/reported allelic ORs (perhaps when they state genotypic ORs they mean publications that estimated/reported OR for each genotype separately, without assuming an additive model - which would be in the minority). If anyone has a contrary view on this then do shout up! In any case, under HWE allelic ORs and genotypic ORs are approximately equal (although the potential effect of misspecification of ORs on subsequent PRS is perhaps interesting theoretically, but I think the assumption of genotypic ORs that we make in PRSice is typically correct - but, worth noting that this is indeed an assumption made by PRSice). 

In terms of your question about how the ORs should be combined to get the homozygote effect OR - I could have made clearer in my example. My example ORs were those estimated in the GWAS (eg. that performed logistic regression) and if the output is in terms of OR and suggests that CT has an (estimated) OR of 3 then the (estimated) OR of TT will be 3x3 = 9 (due to the additive assumption made in logistic regression: additive on a log(OR) scale, multiplicative on an OR scale); if the output was in terms of log(OR) then the log(OR) of TT is 2*log(OR). So if the results were output in terms of log(OR) then they would give 1.098612 in this example (log(OR) of CT, compared to CC) and so the log(OR) of TT (vs CC) would be 2*1.09812 = 2.197224 (e^2.197224 = 9). 

In terms of the target sample size required, that's a function of the power of the base GWAS data, genetic architecture of the phenotype, and the genetic similarity between the base and target phenotypes if different (as a broad rule-of-thumb, you'll probably need at least a few hundred target individuals to get significant association with PRS and the same outcome in target data, for a typical well-powered GWAS, and more like the thousands if testing a different phenotype with moderate-high genetic correlation. This is all very approximate of course but just to give you some idea from results that we've seen... 

Paul

Reply all
Reply to author
Forward
0 new messages