Hi Shawn, I’m hoping you can help me with some difficulty I’m having running SKAT. I have 90 cases and 168 controls for which I am trying to do a case-control analysis. Both with and without covariates run, after the first 80 genes for the non-covariate-applied run (the first 79 were variable p-values), all the p-values are the same.
This is the R code I used:
> FAM<-Read_Plink_FAM(“File.Fam”, Is.binary=TRUE)
> y<-FAM$Phenotype
> SSD.INFO<-Open_SSD(“File.SSD”, “File.Info”)
258 Samples, 14003 Sets, 46856 Total SNPs
Open the SSD file
> obj<-SKAT_Null_Model(y ~ 1, out_type="D")
Sample size (non-missing y and X) = 258, which is < 2000. The small sample adjustment is applied!
> out<-SKAT.SSD.All(SSD.INFO, obj)
There were 40 warnings (use warnings() to see them)
> warnings()
1: The missing genotype rate is 0.011628. Imputation is applied.
2: The missing genotype rate is 0.008721. Imputation is applied.
3: The missing genotype rate is 0.000969. Imputation is applied.
4: The missing genotype rate is 0.009044. Imputation is applied.
……..
> out
…
80 GENE_80 0.807646672535566 1 1
81 GENE_81 0.807646672535566 4 4
82 GENE_82 0.807646672535566 1 1
83 GENE_83 0.807646672535566 5 5
84 GENE_84 0.807646672535566 4 4
85 GENE_85 0.807646672535566 3 3
…
I then attempted to include covariates (headers in Cov.Cov file are SequencingMethod, Sex.y, and Phenotype.y):
> File.Cov<-"./Example1.Cov"
> FAM_Cov<-Read_Plink_FAM_Cov(“File.Fam”, “File.Cov”, Is.binary=TRUE)
> X1 = FAM_Cov$SequencingMethod
> X2 = FAM_Cov$Sex.y
> y<-FAM_Cov$Phenotype.y
> obj<-SKAT_Null_Model(y ~ X1 + X2, out_type="D")
Sample size (non-missing y and X) = 258, which is < 2000. The small sample adjustment is applied!
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: algorithm did not converge
> out<-SKAT.SSD.All(SSD.INFO, obj)
There were 40 warnings (use warnings() to see them)
> warnings()
1: The missing genotype rate is 0.011628. Imputation is applied.
2: The missing genotype rate is 0.008721. Imputation is applied.
3: The missing genotype rate is 0.000969. Imputation is applied.
4: The missing genotype rate is 0.009044. Imputation is applied.
……..
> out
1 GENE_1 0.499413140112253 3 3
2 GENE_2 0.499413140112253 1 1
3 GENE_3 0.499413140112253 2 2
4 GENE_4 0.499413140112253 1 1
5 GENE_5 0.499413140112253 2 2
6 GENE_6 0.499413140112253 4 4
In both of these codes, did I make a mistake somewhere, perhaps missing commands? What could be the reason that the p-values for these genes (both with and without covariates) are almost all identical or are these results correct? I have not incorporated custom weights yet, but it doesn’t seem very likely that the neutral and benign SNPs in most of these genes dilute out the deleterious ones in such a way that the p-values of them are all the same.
Lastly, is there a way to use custom weights without using PP2 scores? It seems
like most of my variants are unable to be predicted by PP2, resulting in a
discrepancy between the SNP set in the custom weights file and those in the
SETID data file.
Thanks in advance,
Emily
Hi Shawn,
Thank you for your reply, I really appreciate it.
I have a very small cohort, 21 cases + 21 controls and have filtered my variant file to only use MAF<0.05 variants for the SKAT test. I do have more controls (>300) but I figured it was best to keep the groups even in number, do you agree?
I do use SKATBinary function and used the following R code (after successfully creating a SSD file):
> FAM<-Read_Plink_FAM(File.fam, Is.binary = TRUE)
> y<-FAM$Phenotype
> SSD.INFO<-Open_SSD(File.SSD, File.Info)
42 Samples, 3590 Sets, 9053 Total SNPs
Open the SSD file
> obj<-SKAT_Null_Model(y ~ 1, out_type = "D")
Sample size (non-missing y and X) = 42, which is < 2000. The small sample adjustment is applied!
> out<-SKATBinary.SSD.All(SSD.INFO, obj, method="optimal.adj")
3500 / 3590 were done
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: 1 SNPs with either high missing rates or no-variation are excluded!
2: 1 SNPs with either high missing rates or no-variation are excluded!
3: The missing genotype rate is 0.077381. Imputation is applied.
4: The missing genotype rate is 0.077381. Imputation is applied.
5: 1 SNPs with either high missing rates or no-variation are excluded!
6: The missing genotype rate is 0.083333. Imputation is applied.
7: 1 SNPs with either high missing rates or no-variation are excluded!
8: The missing genotype rate is 0.083333. Imputation is applied.
9: The missing genotype rate is 0.142857. Imputation is applied.
10: The missing genotype rate is 0.142857. Imputation is applied.
11: The missing genotype rate is 0.047619. Imputation is applied.
12: The missing genotype rate is 0.047619. Imputation is applied.
13: The missing genotype rate is 0.023810. Imputation is applied.
14: The missing genotype rate is 0.023810. Imputation is applied.
15: The missing genotype rate is 0.020408. Imputation is applied.
16: The missing genotype rate is 0.020408. Imputation is applied.
17: The missing genotype rate is 0.018519. Imputation is applied.
18: The missing genotype rate is 0.018519. Imputation is applied.
19: The missing genotype rate is 0.007937. Imputation is applied.
20: The missing genotype rate is 0.007937. Imputation is applied.
21: The missing genotype rate is 0.017857. Imputation is applied.
22: The missing genotype rate is 0.017857. Imputation is applied.
23: Rank of the genotype matrix is one! SKAT is used instead of SKAT-O!
24: Rank of the genotype matrix is one! SKAT is used instead of SKAT-O!
25: Rank of the genotype matrix is one! SKAT is used instead of SKAT-O!
26…………………………………………………………….
> out
$results
SetID P.value N.Marker.All N.Marker.Test MAC m Method.bin
1 A2M 0.50054856 4 3 13 10 ER
2 A2ML1 0.72624114 4 4 21 14 ER.A
3 AAAS 0.86866792 2 2 9 5 ER
4 AAK1 0.05613047 3 2 22 11 ER.A
5 AAMP 0.35647280 1 1 8 4 ER
6 AARSD1 0.11585366 1 1 5 3 ER
7 AATK 0.11585366 1 1 3 3 ER
8 ABCA2 0.19125168 7 7 68 23 MA
9 ABCA7 0.09235855 9 9 51 26 MA
10 ABCB1 0.60422899 3 3 7 6 ER
11 ABCB9 0.29409006 1 1 5 4 ER
12 ABCC8 0.23056135 4 4 19 14 ER
13 ABCD1 0.54690432 2 2 12 4 ER
14 ABCD2 0.54690432 3 3 18 4 ER
15 ABCF1 0.54690432 1 1 6 4 ER
16 ABHD1 0.54690432 2 2 12 4 ER
17 ABHD16A 0.54690432 2 2 12 4 ER
…………………………for all other genes in the list, I get the bolded/underlined p-value (or 3 other very similar p-values). MAC range from 6 to >100.
I realize my data is terribly underpowered but I am trying to do what I can with it and SKAT-O with small sample adjustment seemed like a good option for gene-based tests. I didn´t expect fantastic p-values but got worried when all genes with “variable” p-values start with an “A” (!). I am also planning to add resampling, but I wanted to understand what´s is going on with my first run.
Thanks a lot!
Mia
1) Rare variant analysis: convert VCF files from above to .bed/.bim files (which are condensed .ped/.map using PLINK 1.90) and run SKAT (gene-based analysis, e.g. setID from SNPEff annotations) for rare variants only (see 2.7 of SKAT.pdf) and combined common/rare variants (see 2.4 of SKAT.pdf). Show gene-based analysis results ranked by p-value.
Now let me explain what I did:
a) I downloaded the SKAT package in R and am doing in the Moba Xterm terminal. Also, I downloaded the example_bped file and unzipped to the desktop.
Am not sure of how to import the file in R working directory and in SKAT package most of the examples files are present and when I am trying to Generate_SSD_SetID it throws an error :
installing to /global/home/sa103020/R/x86_64-pc-linux-gnu-library/3.4/SKAT/libs
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (SKAT)
The downloaded source packages are in
‘/tmp/Rtmp7QGlkn/downloaded_packages’
>
> File.Bed<-"./Example1.bed"
> File.Bim<-"./Example1.bim"
> File.Fam<-"./Example1.fam"
> File.SetID<-"./Example1.SetID"
> File.Info<-"./Example1.SSD.info"
> Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info)
Error in Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, :
could not find function "Generate_SSD_SetID".
I couldn't figure it out and any help would be appreciable and have to submit it tomorrow.
In the tutorial part where the original code has taken from SKAT.PDF(2.4 and 2.7) has to be done. Please find the attached of SKAT pdf!!
Thanks for answering questions in this forum, it has been very helpful to browse around and also read your responses to problems other users have.
When performing SKAT-O some time ago, I got a nice hit on one gene where cases and controls had different number of rare SNPs. Now I went back to evaluate and summarize the SKAT-O result and I now have some concerns.
In total, this gene has 16 exonic SNPs of which only 3 occur in cases. My sample set is small and unbalanced (22 cases + 350 controls) and you previously advised me to use all samples together and not try to balance the number of cases and controls (to improve the total number of samples). My concern now is that 10 of the 16 SNPs in controls are singletons only found in one individual each. Would it be better for me to manually evaluate my top genes and look for those where cases appear to have more SNPs than controls? The signal from my top gene only seem to reflect that much higher number of controls in my sample set, or what do you think?
Many thanks for your input.
Mia