Hi Girogia,
It is normal for a set to contain both up and down regulated genes, the sign of the enrichment score will reflect the predominance of the enrichment within the genes that are members of the set.
The only filtering you would want to do is to remove genes that have ~zero counts across all samples. GSEA never uses a list filtered on the basis of pValue or Log2FC.
You would not want to use the raw reads that you used as input for DESeq2, you would need to access the normalized counts table that DESeq2 generates internally.
Generally we'd recommend the normal GCT format with gene set permutation if you have between 3 and 6 samples for each of your two groups, and the GCT format with phenotype permutation if you have 7+ samples per group. GSEA Preranked is generally used as a fallback if there is a specific reason that the standard modes aren't working (i.e. you know you want to use an external Log2FC calculation for ranking instead of our default signal to noise calculation, or you only have N=2 and our internal metrics error, or you are unable to output the appropriate normalization for your data). One isn't strictly better or worse than the other.
DAVID and GSEA have completely different underlying methodology and require different inputs. It is understandable that they would take different inputs and give somewhat different results, it is quite common to use different methodologies like this and to report results side by side. We obviously would have a bias towards the results from GSEA, but hits that are common across multiple methodologies are generally stronger candidates than those that are called by just one.
Let me know if you have more questions or anything here was unclear
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
gsea-help+...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/gsea-help/c53ab3c5-9d11-45bc-9adf-82f7d571117an%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/SJ0PR05MB76093073B2029B40EA4BE098F74A9%40SJ0PR05MB7609.namprd05.prod.outlook.com.
Hi Girogia,
I never said you couldn't! You absolutely can provide an unfiltered log2FC or Stat list in .rnk format as input for GSEA Preranked. That is a perfectly valid way to run GSEA.
I was trying to answer your question about what is preferred in general; the GCT format or the RNK format. GCT format using normalized counts is generally the preferred workflow under the circumstances I outlined, but preranked is perfectly fine assuming the data meets GSEA's "complete dataset" expectations.
My initial answer which you've quoted was specifically what to use for preranked from the DESeq2 results table.
Let me know if you have more questions
-Anthony
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAOhj_QMWcxFpu7kp-MJuOBx3aAv_14zLD0Bgft3gAOgay_U5Bg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/SJ0PR05MB7609F0EEED9FD5D494AD3649F74A9%40SJ0PR05MB7609.namprd05.prod.outlook.com.
Hi Girogia,
I'll address these in-line:
1) if I want to do the PreRanked analysis with the .rnk file, I have to use an unfiltered list of genes with the respective log2FC value next to it.
- This is correct.
2) if I want to do the analysis with the .gct file I have to use the normalized counts and I have to provide as input to GSEA a list of genes with the counts previously filtered by cutoff or it must be unfiltered? So, for example, if I have a list of 40000 genes with their counts, I will provide to GSEA all 40000 genes and not the DEGs that I would have obtained if I had filtered these 40000 genes for Pvalue and foldchange.
- The gct file should contain normalized counts for all genes that were expressed in the samples. This should NOT be filtered by log2fc or pValue. The only filtering that should be done would be to remove genes with zero counts (summed across all samples).
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAOhj_QO_JfhcjYrVNEmLix%2By-zzkubk5hG%3D81ci4h82GD76M2A%40mail.gmail.com.