Data suitability for SKAT

15 views
Skip to first unread message

Aurina Arnatkeviciute

unread,
May 29, 2024, 9:50:34 PMMay 29
to SKAT and MetaSKAT user group
Hello, 

I would appreciate some feedback on some sequencing data I'm analysing with SKAT. After a bioinformatician finished processing the data, I was given a .vcf file and VEP annotation file in the tab format containing variants that PASS all filters in gatk. 

The dataset contains 1194 subjects and 1 069 064 annotated variants (288547 variants with MAF<0.01). In addition to that, I further filtered variants based on annotations, resulting in: 
  • 84474 deleterious missense variants (SIFT != tolerated", "PolyPhen > 0.908); of those 5456 have MAF<0.01; 
  • 31611 loss of function variants (stop_gained, frameshift_variant, stop_lost, start_lost, splice_acceptor_variant splice_donor_variant); Of those 1178 have MAF<0.01; 
When I create SNP sets for all available genes (combining both loss of function and deleterious missense variants) the number of variants in a gene ranges from only 1 to 50 with a median of 5. 

It's my first time working with this type of data, but it seems that ~1mln variants in an exome sequencing dataset is relatively low. My concern is that the sets of deleterious missense variants and loss of function variants will not have enough rare variants for association testing. When running SKATO analyses using larger SNP sets, for example, selecting variants in genes associated with a disorder (ADHD in this case),  this set contains 475 variants and of those only 370 are used for association testing (the rest don't show variation in the sample). 

Could anyone please comment on the suitability of these data for the analysis with SKATO? Any idea if this dataset is ok and if not, which processing step should be investigated? 

Thank you very much! 

Kind regards, 
Aurina

Reply all
Reply to author
Forward
0 new messages