new to SKAT

1,520 views
Skip to first unread message

Emily Chen

unread,
Jan 14, 2015, 4:54:11 PM1/14/15
to SKAT...@googlegroups.com
Hi, I am trying to run a case-control analysis using SKAT-O with covariates and have a couple of questions. I have the .Bed, .Bim, .Fam, and .SetID created.
 
1) I have 90 cases and 168 controls to run together. I have a few conditions I would like to set for this case-control analysis: 
  -I want to test for association between a set of SNPs/genes and a dichotomous phenotype (case or control)
  -I want to test this on both common and rare variants
  -I would like either FDR or Bonferroni corrected p-values that are based on a gene-level test
  -I want to include covariates (e.g. sequencing method, gender)
What would be the best Usage to use on an analysis of this sort that includes those 4 points? Also, how would I import covariates into the command? Could you give an example of what a covariate file looks like?
 
2) Not certain of where to begin, I started with the Usage Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info) with all of the file names each in quotations. I understood it to be that File.SSD and File.Info do not exist and are the files that will be created by this command. When I run the Usage, I am getting the following error:
Error in '[.data.frame'(SSD.Info, , 2) : undefined columns selected
How do I remedy this error?
 
Thank you in advance,
Emily

Seunggeun (Shawn) Lee

unread,
Jan 21, 2015, 4:34:04 PM1/21/15
to SKAT...@googlegroups.com
Hi Emily,

1) 
- I think you can use SKAT_CommonRare functions to test for both common and rare variants. When you test for rare variants only, please use the newly developed SKATBinary functions. 
- You can find an example covariate file in the example data set (files with .cov or .Cov). I think the package vignettes have an example code. 
- For bonferroni correction, you can use alpha=0.05/K, where K is the number of gene to test. This correction is based on an assumption that there is no correlation (so no LD) between genes, but I think it works reasonably well. 

2) I am not sure why you get this error message. Have you used the example dataset? Did you get the same error? If you get the same error, can you send me your R-code, so I can figure it out why you had this error?

Thanks,
Shawn

Message has been deleted

Emily Chen

unread,
Jan 22, 2015, 2:19:56 PM1/22/15
to SKAT...@googlegroups.com

Hi Shawn, I’m hoping you can help me with some difficulty I’m having running SKAT. I have 90 cases and 168 controls for which I am trying to do a case-control analysis. Both with and without covariates run, after the first 80 genes for the non-covariate-applied run  (the first 79 were variable p-values), all the p-values are the same.

 

This is the R code I used:

> FAM<-Read_Plink_FAM(“File.Fam”, Is.binary=TRUE)

> y<-FAM$Phenotype

> SSD.INFO<-Open_SSD(“File.SSD”, “File.Info”)

 258 Samples, 14003 Sets, 46856 Total SNPs

Open the SSD file

> obj<-SKAT_Null_Model(y ~ 1, out_type="D")

Sample size (non-missing y and X) = 258, which is < 2000. The small sample adjustment is applied!

> out<-SKAT.SSD.All(SSD.INFO, obj)

There were 40 warnings (use warnings() to see them)

> warnings()

1: The missing genotype rate is 0.011628. Imputation is applied.

2: The missing genotype rate is 0.008721. Imputation is applied.

3: The missing genotype rate is 0.000969. Imputation is applied.

4: The missing genotype rate is 0.009044. Imputation is applied.

……..

> out

80 GENE_80       0.807646672535566  1 1

81 GENE_81       0.807646672535566  4 4

82 GENE_82       0.807646672535566  1 1

83 GENE_83       0.807646672535566  5 5

84 GENE_84       0.807646672535566  4 4

85 GENE_85       0.807646672535566  3 3

 

I then attempted to include covariates (headers in Cov.Cov file are SequencingMethod, Sex.y, and Phenotype.y):

> File.Cov<-"./Example1.Cov"

> FAM_Cov<-Read_Plink_FAM_Cov(“File.Fam”, “File.Cov”, Is.binary=TRUE)

> X1 = FAM_Cov$SequencingMethod

> X2 = FAM_Cov$Sex.y

> y<-FAM_Cov$Phenotype.y

> obj<-SKAT_Null_Model(y ~ X1 + X2, out_type="D")

Sample size (non-missing y and X) = 258, which is < 2000. The small sample adjustment is applied!

Warning messages:

1: glm.fit: algorithm did not converge

2: glm.fit: algorithm did not converge

> out<-SKAT.SSD.All(SSD.INFO, obj)

There were 40 warnings (use warnings() to see them)

> warnings()

1: The missing genotype rate is 0.011628. Imputation is applied.

2: The missing genotype rate is 0.008721. Imputation is applied.

3: The missing genotype rate is 0.000969. Imputation is applied.

4: The missing genotype rate is 0.009044. Imputation is applied.

……..

> out

1 GENE_1       0.499413140112253  3 3

2 GENE_2       0.499413140112253  1 1

3 GENE_3       0.499413140112253  2 2

4 GENE_4       0.499413140112253  1 1

5 GENE_5       0.499413140112253  2 2

6 GENE_6       0.499413140112253  4 4

 

In both of these codes, did I make a mistake somewhere, perhaps missing commands? What could be the reason that the p-values for these genes (both with and without covariates) are almost all identical or are these results correct? I have not incorporated custom weights yet, but it doesn’t seem very likely that the neutral and benign SNPs in most of these genes  dilute out the deleterious ones in such a way that the p-values of them are all the same.


Lastly, is there a way to use custom weights without using PP2 scores? It seems like most of my variants are unable to be predicted by PP2, resulting in a discrepancy between the SNP set in the custom weights file and those in the SETID data file.

 

Thanks in advance,

Emily

Seunggeun (Shawn) Lee

unread,
Jan 30, 2015, 4:27:09 PM1/30/15
to SKAT...@googlegroups.com
Hi Emily,

It doesn't seem that there is any error in the code. Is it possible that you generate some example dataset and send it to me, so I can check why this error happens? 

Thanks,
Shawn

alanita

unread,
Apr 24, 2015, 3:16:27 AM4/24/15
to SKAT...@googlegroups.com
Hi
I hve more or less the same situation as Emily (sample size 200), an  cae control study. I was just wondering which are the optimal value for some parameters. I read in the SKat paper that in small samples it should be used a specific estimador for the  the variance.
I was just wondering which are the best parameters to realize each test.. For example should i use a specific kernel?
Thank you in advance

Seunggeun (Shawn) Lee

unread,
May 7, 2015, 9:38:15 AM5/7/15
to SKAT...@googlegroups.com
Hi Alanita,

1) For small sample case-control studies, you can use SKATBinary function to compute p-values of SKAT and SKAT-O. 
2) If you don't have any prior information on which form of tests (ex SKAT or Burden tests) would be better, you can use SKAT-O, which combines SKAT and burden test approaches. 
3) SKAT and SKAT-O use beta weights (1,25) to upweight rarer variants. If you want to mainly test for rare variant associations, you can use this weight. 

Thanks,
Shawn


Mia

unread,
May 1, 2017, 3:37:44 AM5/1/17
to SKAT and MetaSKAT user group
Hi Shawn and Emily, 

I realize this was >2 years ago but I have the same problem as Emily- getting mostly identical  p-valus when running SKAT-0 (without co-variates). Can you recall how you solved this problem, or if anything was wrong with the input files?

Many Thanks

Mia 

Seunggeun (Shawn) Lee

unread,
May 5, 2017, 10:31:09 AM5/5/17
to SKAT and MetaSKAT user group
Hi Mia,

1) Have you used SKATBinary function? It works better for binary traits. 

2) Even if you used SKATBinary, if the total minor allele count (MAC) is very small, it is possible that p-values are similar. But before seeing example or data, it is difficult to say what causes this problem. Can you provide me more details on your data (ex. sample size, case-control ratio,  outcomes from SKATBinary, etc)

Thanks,
Shawn

Mia

unread,
May 10, 2017, 1:44:33 AM5/10/17
to SKAT and MetaSKAT user group

Hi Shawn,

 

Thank you for your reply, I really appreciate it.

 

I have a very small cohort, 21 cases + 21 controls and have filtered my variant file to only use MAF<0.05 variants for the SKAT test. I do have more controls (>300) but I figured it was best to keep the groups even in number, do you agree?  

 

I do use SKATBinary function and used the following R code (after successfully creating a SSD file):

 

> FAM<-Read_Plink_FAM(File.fam, Is.binary = TRUE)

> y<-FAM$Phenotype

> SSD.INFO<-Open_SSD(File.SSD, File.Info)

42 Samples, 3590 Sets, 9053 Total SNPs

Open the SSD file

> obj<-SKAT_Null_Model(y ~ 1, out_type = "D")

Sample size (non-missing y and X) = 42, which is < 2000. The small sample adjustment is applied!

> out<-SKATBinary.SSD.All(SSD.INFO, obj, method="optimal.adj")

 3500 / 3590 were done

There were 50 or more warnings (use warnings() to see the first 50)

 

> warnings()

Warning messages:

1: 1 SNPs with either high missing rates or no-variation are excluded!

2: 1 SNPs with either high missing rates or no-variation are excluded!

3: The missing genotype rate is 0.077381. Imputation is applied.

4: The missing genotype rate is 0.077381. Imputation is applied.

5: 1 SNPs with either high missing rates or no-variation are excluded!

6: The missing genotype rate is 0.083333. Imputation is applied.

7: 1 SNPs with either high missing rates or no-variation are excluded!

8: The missing genotype rate is 0.083333. Imputation is applied.

9: The missing genotype rate is 0.142857. Imputation is applied.

10: The missing genotype rate is 0.142857. Imputation is applied.

11: The missing genotype rate is 0.047619. Imputation is applied.

12: The missing genotype rate is 0.047619. Imputation is applied.

13: The missing genotype rate is 0.023810. Imputation is applied.

14: The missing genotype rate is 0.023810. Imputation is applied.

15: The missing genotype rate is 0.020408. Imputation is applied.

16: The missing genotype rate is 0.020408. Imputation is applied.

17: The missing genotype rate is 0.018519. Imputation is applied.

18: The missing genotype rate is 0.018519. Imputation is applied.

19: The missing genotype rate is 0.007937. Imputation is applied.

20: The missing genotype rate is 0.007937. Imputation is applied.

21: The missing genotype rate is 0.017857. Imputation is applied.

22: The missing genotype rate is 0.017857. Imputation is applied.

23: Rank of the genotype matrix is one! SKAT is used instead of SKAT-O!

24: Rank of the genotype matrix is one! SKAT is used instead of SKAT-O!

25: Rank of the genotype matrix is one! SKAT is used instead of SKAT-O!

26…………………………………………………………….

 

> out

$results

               SetID    P.value N.Marker.All N.Marker.Test MAC  m Method.bin

1                A2M 0.50054856            4             3  13 10         ER

2              A2ML1 0.72624114            4             4  21 14       ER.A

3               AAAS 0.86866792            2             2   9  5         ER

4               AAK1 0.05613047            3             2  22 11       ER.A

5               AAMP 0.35647280            1             1   8  4         ER

6             AARSD1 0.11585366            1             1   5  3         ER

7               AATK 0.11585366            1             1   3  3         ER

8              ABCA2 0.19125168            7             7  68 23         MA

9              ABCA7 0.09235855            9             9  51 26         MA

10             ABCB1 0.60422899            3             3   7  6         ER

11             ABCB9 0.29409006            1             1   5  4         ER

12             ABCC8 0.23056135            4             4  19 14         ER

13             ABCD1 0.54690432            2             2  12  4         ER

14             ABCD2 0.54690432            3             3  18  4         ER

15             ABCF1 0.54690432            1             1   6  4         ER

16             ABHD1 0.54690432            2             2  12  4         ER

17           ABHD16A 0.54690432            2             2  12  4         ER

 

…………………………for all other genes in the list, I get the bolded/underlined p-value (or 3 other very similar p-values). MAC range from 6 to >100.  

 

I realize my data is terribly underpowered but I am trying to do what I can with it and SKAT-O with small sample adjustment seemed like a good option for gene-based tests. I didn´t expect fantastic p-values but got worried when all genes with “variable” p-values start with an “A” (!). I am also planning to add resampling, but I wanted to understand what´s is going on with my first run.  

 

Thanks a lot!

Mia

 

 

Seunggeun (Shawn) Lee

unread,
May 12, 2017, 4:46:27 PM5/12/17
to SKAT and MetaSKAT user group
Hi Mia, 

1) I have a very small cohort, 21 cases + 21 controls and have filtered my variant file to only use MAF<0.05 variants for the SKAT test. I do have more controls (>300) but I figured it was best to keep the groups even in number, do you agree?  

Using more controls can improve power, so I recommend using all of them. 

2) Same p-values 
Since your sample size is small and the number of cases and controls are the same, many genes will have the same p-value. I think the problem will be alleviated in some degree if you use all the controls (but some genes will still have the same p-values)

Thanks,
Shawn

Trish

unread,
Sep 19, 2017, 5:51:08 PM9/19/17
to SKAT and MetaSKAT user group
hi Shawn ~ 
I am not actually replying to Emily ~ sorry. I am new to this Google Groups format and don't see where to start a new thread within the New to SKAT page. 

My question is this: I have Illumina-output data, along with multi-variable phenotype data. I am not seeing clear instructions on how to format data for use with SKAT. Do I need to format the SNP data as I see in the genipe example here? 


I would also like to use a moving-window method, to begin. Any assistance you can give me in getting started would be greatly appreciated. 
Thank you! ~
Trish 

Seunggeun (Shawn) Lee

unread,
Sep 20, 2017, 10:04:23 AM9/20/17
to SKAT and MetaSKAT user group
Hi Trish,

Frankly, I haven't used genipe, so I don't know how it works. 
If you can convert your data to plink bed file, you can use SKAT R-package. Please take a look at the vignette page 7  (https://cran.r-project.org/web/packages/SKAT/vignettes/SKAT.pdf)

Thanks,
Shawn

ganga...@gmail.com

unread,
Feb 27, 2018, 1:44:53 AM2/27/18
to SKAT and MetaSKAT user group
Hi,

Can someone help me with SKAT tutorial and I have zero clue why I am getting this error.

This is the part I am supposed to do

1) Rare variant analysis: convert VCF files from above to .bed/.bim files (which are condensed .ped/.map using PLINK 1.90) and run SKAT (gene-based analysis, e.g. setID from SNPEff annotations) for rare variants only (see 2.7 of SKAT.pdf) and combined common/rare variants (see 2.4 of SKAT.pdf). Show gene-based analysis results ranked by p-value.


Now let me explain what I did:


a) I downloaded the SKAT package in R  and am doing in the Moba Xterm terminal. Also, I downloaded the example_bped file and unzipped to the desktop. 

Am not sure of how to import the file in R working directory and in SKAT package most of the examples files are present and when I am trying to Generate_SSD_SetID it throws an error :


installing to /global/home/sa103020/R/x86_64-pc-linux-gnu-library/3.4/SKAT/libs

** R

** data

** inst

** preparing package for lazy loading

** help

*** installing help indices

** building package indices

** installing vignettes

** testing if installed package can be loaded

* DONE (SKAT)


The downloaded source packages are in

        ‘/tmp/Rtmp7QGlkn/downloaded_packages’

>

> File.Bed<-"./Example1.bed"

> File.Bim<-"./Example1.bim"

> File.Fam<-"./Example1.fam"

> File.SetID<-"./Example1.SetID"

> File.Info<-"./Example1.SSD.info"

> Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info)

Error in Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID,  :

  could not find function "Generate_SSD_SetID".

I couldn't figure it out and any help would be appreciable and have to submit it tomorrow.


In the tutorial part where the original code has taken from SKAT.PDF(2.4 and 2.7) has to be done. Please find the attached of SKAT pdf!!



.7 Plink Binary format le\
SKAT package can use plink binary format les for genome-wide data analysis.
To use plink les, plink bed, bim and fam les, and your own setid le that
contains information of SNP sets are needed. Example les can be found on the
SKAT/MetaSKAT google group page.
> # To run this code, first download and unzip example files
>
> ##############################################
> # Generate SSD file
>
> # Create the MW File
> File.Bed<-"./Example1.bed"
> File.Bim<-"./Example1.bim"
> File.Fam<-"./Example1.fam"
> File.SetID<-"./Example1.SetID"
> File.SSD<-"./Example1.SSD"
> File.Info<-"./Example1.SSD.info"
> # To use binary ped files, you have to generate SSD file first.
> # If you already have a SSD file, you do not need to call this function.
> Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info)
Check duplicated SNPs in each SNP se



SKAT.pdf

Seunggeun (Shawn) Lee

unread,
Mar 8, 2018, 9:18:04 AM3/8/18
to SKAT and MetaSKAT user group
Hi 

Did you load SKAT library using library function? It seems like SKAT package is not loaded. 

Thanks,
Shawn

Mia

unread,
Oct 30, 2018, 9:43:20 AM10/30/18
to SKAT and MetaSKAT user group

Hello Shawn, 

 

Thanks for answering questions in this forum, it has been very helpful to browse around and also read your responses to problems other users have. 

 

When performing SKAT-O some time ago, I got a nice hit on one gene where cases and controls had different number of rare SNPs. Now I went back to evaluate and summarize the SKAT-O result and I now have some concerns. 

 

In total, this gene has 16 exonic SNPs of which only 3 occur in cases. My sample set is small and unbalanced (22 cases + 350 controls) and you previously advised me to use all samples together and not try to balance the number of cases and controls (to improve the total number of samples). My concern now is that 10 of the 16 SNPs in controls are singletons only found in one individual each. Would it be better for me to manually evaluate my top genes and look for those where cases appear to have more SNPs than controls? The signal from my top gene only seem to reflect that much higher number of controls in my sample set, or what do you think? 

 

Many thanks for your input. 

 

Mia

Seunggeun (Shawn) Lee

unread,
Nov 5, 2018, 1:23:54 PM11/5/18
to SKAT and MetaSKAT user group
Hi Mia,

Did you use SKATBinary? For this case-control imbalance, you need to use this function to get a more accurate p-value. 

I think it will be good to manually evaluate whether p-values of your top genes are reasonable. 
Also, you can check the burden of rare variants among cases and controls to see whether the results are reasonable. 

Thanks,
Shawn

Mia

unread,
Nov 8, 2018, 8:38:08 AM11/8/18
to SKAT and MetaSKAT user group
Thanks Shawn- it sound like we agree that I need to check my top genes one by one, and ignore those with many singletons in controls. Yes, I used SKATBinary and the p-values are not overwhelming (not significant after Bonferroni correction) but I would like to report the result anyway. Cheers, Mia
Reply all
Reply to author
Forward
0 new messages