Hello,
When you said you ranked using log(pvalue) * (sign of FC), did you mean the -log? The log transformation function should be a negative log.
Also, I'm not sure what you mean by GSEA "randomly assigning ranks", GSEA uses the user-supplied ranks in GSEA Preranked mode.
With regard to issues with pValue calculation, GSEA isn't designed to operate on proteomic datasets and may struggle with the small size of the dataset you're using as input. This can be because in Preranked mode, GSEA performs significance testing by gene-set permutation, meaning, it generates random lists of genes of the same size as the gene set being tested to compare the likelihood of a gene set of that size being randomly enriched in the data. If your "pathway A" is relatively large in comparison to your relatively small mass spec gene list (the expected size of an input dataset is 10,000-20,000 genes/features), then GSEA might not be able to compute reasonable nulls.
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
gsea-help+...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/gsea-help/ab0793d2-abad-47dc-bc5e-d1b82c716ce4n%40googlegroups.com.
The mode of calculating nulls that you're describing is similar to that used in the Phenotype permutation mode of GSEA, which computes the nulls by scrambling your sample labels and computing differential expression between the randomized groups, for the obvious reason (no groups and no per-sample information) this mode of null generation is not available in GSEA Preranked.
62 genes in on the smaller end of gene set sizes, the default minimum is 50. GSEA is about calculating how overrepresented at one end of the distribution a collection of genes are as a whole compared to random chance, until you get to extremely small gene sets, or extremely large (relative to the size of the ranked list) gene sets the size of the expression list is more important than the size of the gene set. 62 genes out of a 2000 member ranked list is a relative size that should be reasonable, so, I would bet that the issue with the pValues might have to do with the lists ranking.
If a large proportion of those 2000 genes are highly ranked, then the spread of enriched nulls might be skewed. On the enrichment results page for your gene set of interest, there will be two plots, the "Enrichment Plot: " and at the very bottom, the "Random ES Distribution", the Enrichment plot shows you how the positions of your genes of interest compare to the ranked list as a whole – what you'll want to look for in this plot is a skew in the zero cross, or a "stepping" like appearance in what should be a smooth distribution. These could indicate issues with your ranked list. In the Random ES distribution plot, the normal appearance is a relatively tight but smooth bimodal distribution of the random ES deviations from this can indicate issues with generating the null.
If you're willing to send those plots so I can take a look and give you my thoughts, you can find the raw images in GSEA's results directory.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/726226b0-a1df-47f2-bc74-5aea9626f9a2n%40googlegroups.com.