SKAT-O in R vs skat-o in epacts

Alessandro Gialluisi

unread,

Jun 28, 2017, 9:22:27 AM6/28/17

to SKAT and MetaSKAT user group

Hi All, I have been running a skat-o test on a small GWAS (123 cases vs 107 ctrls), using both the original R package and the skat-o command in Epacts, as below

in Epacts:
epacts group -test skat -skat-o -skat-adjust -vcf ../GENO.recode.vcf.gz -ped ../PDsamples_TSI_merged.modifiedPED.ped -groupf GENO.grp -out skat-o/skat-o -cov sex -cov C1 -cov C2 -cov C3 -cov C4 -cov C5 -cov C6 -cov C7 -cov C8 -cov C9 -cov C10 -pheno pheno -missing ./. -field GT --min-maf 0.001 --max-maf 0.1 --min-mac 1 --min-callrate 0.9 -run 1

in R:
phenocov <- Read_Plink_FAM_Cov(paste(dataset2, ".correct.fam", sep=""), paste(dataset2, ".covariates.txt", sep=""), Is.binary=TRUE, flag1=0, cov_header=TRUE)
Generate_SSD_SetID("GENO.b.bed", "GENO.b.bim", "GENO.b.fam", "GENO.setID", File.SSD="GENO.SSD", File.Info="GENO.SSD.Info")
SSD.INFO <- Open_SSD(File.SSD="GENO.SSD", File.Info="GENO.SSD.Info")
nullModelKin <- SKAT_NULL_emmaX(Phenotype ~ sex + C1 + C2 + C3 + C4 + C5 + C6 + C7 + C8 + C9 + C10, data=phenocov, K=NULL, Kin.File="GENO.kinf")
#4-ii# skat-o (with small sample size adjustment)
SKATresults <- SKAT.SSD.All(SSD.INFO, nullModelKin, method="optimal.adj")
SKATresults2 <- SKATresults$results

What is strange is that I get different results (see below top enriched genes, with p-value < 10-4).

in Epacts:
CHROM BEGIN END MARKER_ID NS FRAC_WITH_RARE NUM_ALL_VARS NUM_PASS_VARS NUM_SING_VARS PVALUE STATRHO
16 22376218 22376218 16:22376218-22376218_CDR2 230 0.03913 1 1 0 3.5503e-05 NA
17 40189847 40189847 17:40189847-40189847_ZNF385C 230 0.021739 1 1 0 4.6054e-05 NA
16 4390358 4390358 16:4390358-4390358_CORO7 230 0.14348 1 1 0 7.221e-05 NA
16 4390358 4390358 16:4390358-4390358_PAM16 230 0.14348 1 1 0 8.2825e-05 NA
9 111947761 112082510 9:111947761-112082510_EPB41L4B 230 0.1913 3 2 0 0.00010607 0
8 52232487 52366300 8:52232487-52366300_PXDNL 230 0.3087 11 7 0 0.00014479 0.09
17 41738823 41738823 17:41738823-41738823_MEOX1 230 0.17826 1 1 0 0.00014628 NA
2 103335441 103335646 2:103335441-103335646_MFSD9 230 0.17391 2 2 0 0.00018377 0.5
12 117977550 117977550 12:117977550-117977550_KSR2 230 0.034783 1 1 0 0.00021868 NA
19 38160383 38160760 19:38160383-38160760_ZNF781 230 0.22174 3 3 0 0.00023177 0
15 91449962 91461545 15:91449962-91461545_MAN2A2 230 0.25652 6 4 0 0.00026198 0
1 151254023 151254036 1:151254023-151254036_RP11-126K1.2 230 0.13478 2 1 0 0.00041281 NA
19 5598801 5598801 19:5598801-5598801_SAFB2 230 0.1 1 1 0 0.00053932 NA
13 52439890 52440130 13:52439890-52440130_CCDC70 230 0.026087 2 1 0 0.00056331 NA
17 27943024 27943024 17:27943024-27943024_CORO6 230 0.043478 1 1 0 0.00067776 NA
2 173337495 173344474 2:173337495-173344474_ITGA6 230 0.11304 3 1 0 0.00070369 NA
18 11639335 11639335 18:11639335-11639335_RP11-677O4.1 230 0.15217 1 1 0 0.00090373 NA

in R:
SetID P.value N.Marker.All N.Marker.Test
RNU6-319P 5.3203609018011e-46 1 1
AIP 3.07010287331135e-32 2 2
DDX11L5 5.59063898526521e-12 2 2
XXyac-YRM2039.2 5.59063898526521e-12 2 2
POTEH 1.0850000246997e-08 1 1
ABHD11 2.109200647471e-07 4 4
LINC00035 2.109200647471e-07 4 4
RP11-396K3.1 1.0269648080756e-05 1 1
MIR520E 3.16469441355993e-05 3 3
MIR515-1 3.1695494745354e-05 6 6
MIR519E 3.18533547117861e-05 7 7
MIR520F 3.18533547117861e-05 7 7
PARGP1 4.73016843232088e-05 4 4
TIMM23 4.73016843232088e-05 4 4
ZDHHC19 8.87258318733597e-05 15 15

It seems that in Epacts the skat-o command somehow runs some kind of QC on the variants analyzed, althoug I was not able to find details on this QC in the relevant page. In principle I would rely more on SKAT in R, but this seems to show very inflated p-values in this case, which I did not expect at all given single variant associations obtained in these genes..maybe some of you can tell me whether I did anything wrong? Maybe specifying the weight of variants would help? I used SKAT R package in other GWAS and got results somehow consistent with single variant associations, and not inflated at all..

I am looking forward to hearing from you.

Best,
Alessandro

Seunggeun (Shawn) Lee

unread,

Jul 7, 2017, 10:52:50 AM7/7/17

to SKAT and MetaSKAT user group

Hi Alessandro,

Thanks for investigating it. Hyun Min implemented EMMAX version of SKAT-O somewhat different way, but it is possible that my code has a bug since it hasn't been extensively tested.

Can you provide me some example dataset, so I can investigate why you have inflated type I error rates.

Thanks,
Shawn

Alessandro Gialluisi

unread,

Jul 17, 2017, 12:42:39 PM7/17/17

to SKAT and MetaSKAT user group

Hi Shawn,
I am going to send you the data that I analyzed in a zipped folder in a private email message.

Thanks,
Alessandro

Seunggeun (Shawn) Lee

unread,

Jul 21, 2017, 2:53:59 PM7/21/17

to SKAT and MetaSKAT user group

Hi Alessandro,

I think the main reason for the difference is because you used --max-maf 0.1 in EPACTs so only used variants with MAF < 0.1. If a set has both rare and common variants, since the beta-weighting down-weights common variants, SKAT effectively tests for rare variant effect. However, if a set has common variants only (ex. RNU6-319P and AIP), SKAT tests for the common variant effect. To check this, you can use Get_Genotypes_SSD function to get genotypes and MAFs. In the next version, I will consider to add an option for max-maf.

ex) # Get genotypes of top 10 sets

SKATresults <- SKAT.SSD.All(SSD.INFO, nullModelKin)

ord<-order(SKATresults$results$P.value, decreasing=FALSE)[1:10]

SKATresults$results[ord,]

Z.list<-list()

MAF.list<-list()

for(i in 1:10){

Z.list[[i]]<-Get_Genotypes_SSD(SSD.INFO, ord[i])

MAF.list[[i]]<-colMeans(Z.list[[i]])

}

Seunggeun (Shawn) Lee

unread,

Jul 21, 2017, 3:38:02 PM7/21/17

to SKAT and MetaSKAT user group

BTW, I have implemented this cutoff (max_maf) in version 1.3.2, which can be downloaded from github

library(devtools)

install_github("leeshawn/SKAT")

Example code:

> SKATresults <- SKAT.SSD.All(SSD.INFO, nullModelKin, max_maf=0.1)

|======================================================================| 100%

There were 50 or more warnings (use warnings() to see the first 50)

> ord<-order(SKATresults$results$P.value, decreasing=FALSE)[1:10]

> SKATresults$results[ord,]

SetID P.value N.Marker.All N.Marker.Test

14428 MIR515-1 3.173398e-05 6 3

14458 MIR519E 3.173398e-05 7 3

14464 MIR520F 3.173398e-05 7 3

14463 MIR520E 3.173398e-05 3 2

29756 ZDHHC19 6.036784e-05 15 2

22849 RP11-631M21.6 6.535536e-05 13 7

28705 TUBB8 6.535536e-05 13 7

14429 MIR515-2 9.243062e-05 7 2

14456 MIR519C 9.995662e-05 9 4

12861 LINGO3 3.080339e-04 4 3

Thanks,

Shawn

Alessandro Gialluisi

unread,

Aug 4, 2017, 2:48:04 PM8/4/17

to SKAT and MetaSKAT user group

Hi Shawn, thanks for your answer.

Just to get it clearer:

do you mean that the reason why genes like RNU6-319P and AIP show inflated enrichment stats is that SKAT give extra-weight to common variants in absence of rare variants to test?
in case I want to consider both common and rare variants effect (i.e. without using the max_maf filter), how do you suggest to modify the beta function of the weights as a function of MAF?

I tried the options weights.beta=c(0.05,0.5), weights.beta=c(1,25), weights.beta=c(1,50), but I still get very inflated enrichment stats (with very similar p-values to those obtained above)
I also tried to use SKAT_CommonRare to better balance the weight of common variants, but still get inflated results (see below)

nullModel <- SKAT_Null_Model(Phenotype ~ sex + C1 + C2 + C3 + C4 + C5 + C6 + C7 + C8 + C9 + C10, data=phenocov, out_type="D", Adjustment=T)
SKATresults <- SKAT_CommonRare.SSD.All(SSD.INFO, nullModel, weights.beta.rare=c(1,25), weights.beta.common=c(0.5,0.5),method="C", r.corr.rare=0, r.corr.common=0, CommonRare_Cutoff=0.05, test.type="Joint", is_dosage=FALSE, missing_cutoff=0.15, estimate_MAF=1, SetID1=NULL)

SetID P.value Q N.Marker.All N.Marker.Test N.Marker.Rare N.Marker.Common
RNU6-319P 2.48145485991083e-27 331.487179927506 1 1 0 1
POTEF 6.04419195195118e-20 44.8580057973526 4 4 3 1
PABPC3 1.67683891621373e-18 40.0173844559319 4 4 1 3
AIP 3.04923868032476e-16 120.786322820744 2 2 0 2
RN7SL195P 3.55098384696462e-16 212.524256696052 3 3 0 3
CTB-129P6.4 2.93212454341871e-11 21.6558624800607 13 13 5 8
PVRL2 6.45152800938057e-11 20.0288467037818 17 17 6 11
DDX11L5 8.54113679581071e-11 103.744449738631 2 2 0 2
XXyac-YRM2039.2 8.54113679581071e-11 103.744449738631 2 2 0 2
RSAD2 1.24179330795696e-10 24.5198479109113 3 3 1 2
SNAPC3 1.90667221639699e-10 21.7128377926404 8 8 2 6
CMPK2 7.21901049647807e-09 21.101622871925 3 3 1 2
POTEH 1.23342506320256e-08 42.0512923581711 1 1 0 1
MED17 2.26595463991945e-08 222.795938646162 6 6 0 6
MAML3 2.32726537933986e-08 16.8291778269829 8 8 1 7
PITPNM1 3.79093149165064e-08 18.0900381685523 4 4 1 3
ABHD11 5.661579196307e-07 106.260868047573 4 4 0 4
LINC00035 5.661579196307e-07 106.260868047573 4 4 0 4
AC007192.6 4.85742308135792e-06 93.9353809359453 6 6 0 6
PHF12 4.93347836878327e-06 110.181347919776 6 6 0 6
NBPF1 1.00890878185957e-05 10.7567692390767 7 7 1 6

How do you suggest to proceed with assigning weights?
Of course I would like to assign more weight to rare variants, but in this case it seems to me that common variants have too much weight in determining the enrichment stat, right?

Thanks a lot for your help,
Alessandro

Seunggeun (Shawn) Lee

unread,

Aug 16, 2017, 2:02:37 PM8/16/17

to SKAT and MetaSKAT user group

Hi Alessandro,

Sorry for the late response.

do you mean that the reason why genes like RNU6-319P and AIP show inflated enrichment stats is that SKAT give extra-weight to common variants in absence of rare variants to test?

As you can see, RNU6-319P has only one (common) variants, so without thresholding common variants, SKAT just tests for this common variant. It was the main reason for the discrepancy of the results between SKAT package and EPacts.

in case I want to consider both common and rare variants effect (i.e. without using the max_maf filter), how do you suggest to modify the beta function of the weights as a function of MAF?

I tried the options weights.beta=c(0.05,0.5), weights.beta=c(1,25), weights.beta=c(1,50), but I still get very inflated enrichment stats (with very similar p-values to those obtained above)

I also tried to use SKAT_CommonRare to better balance the weight of common variants, but still get inflated results (see below)

You can use SKAT_CommonRare function to test for both common and rare variants.

BTW, when I have looked at genotypes of RNU6-319P, I have found that nearly all cases have g=0 and nearly all controls have g=2. So there may be an error in genotype matrix (genotype coding error) or there exist strong batch effects. Probably you need to investigate it.

Thanks,

Shawn

Alessandro Gialluisi

unread,

Aug 28, 2017, 4:26:29 AM8/28/17

to SKAT and MetaSKAT user group

Hi Shawn,
thanks for the clarifications.

It may well be that there is a strong batch effect since my controls are from 1000G TSI pop while my cases were sequenced in a different experiment with a totally different platform, and I have encountered similar issues while testing for single variant association, even when using MDS components as covariates.

Indeed, using the same pipeline both for input file preparation and SKAT testing in a different dataset, I did not encounter this issue where I have both cases and ctrls from the same population.

Thanks a lot for these hints,
you were of great help.

Best,
Alessandro

Reply all

Reply to author

Forward