GSEA input file

Giorgia Silvestrini

unread,

Jan 4, 2022, 4:19:44 PM1/4/22

to gsea-help

Hi Anthony,
I redid the pre-ranked analysis using the unfiltered data as you suggested. However, again within the pathway with positive ES, for example, I get genes that have positive and negative log2FC values.
If, instead, I do the normal analysis using as input a .gct file with the reads of the DESeq2 file, can I use the list of DEGs filtered on the basis of cutoff (padjusted and log2FC) or also in this case an unfiltered list is used?
Which of the two analysis is better: pre-ranked or the normal .gct one?
Because I wanted to perform the enrichment analysis with both DAVID and GSEA to have a stronger data, so I thought it was correct to provide as input to the two tools the same list of DEGs selected on the basis of my cutoffs. If I provide DAVID with the DEGs and GSEA with an unfiltered list of genes is that a problem?
Thanks a lot,
Giorgia

Translated with www.DeepL.com/Translator (free version)

Anthony Castanza

unread,

Jan 4, 2022, 4:53:21 PM1/4/22

to gsea...@googlegroups.com

Hi Girogia,

It is normal for a set to contain both up and down regulated genes, the sign of the enrichment score will reflect the predominance of the enrichment within the genes that are members of the set.

The only filtering you would want to do is to remove genes that have ~zero counts across all samples. GSEA never uses a list filtered on the basis of pValue or Log2FC.

You would not want to use the raw reads that you used as input for DESeq2, you would need to access the normalized counts table that DESeq2 generates internally.

Generally we'd recommend the normal GCT format with gene set permutation if you have between 3 and 6 samples for each of your two groups, and the GCT format with phenotype permutation if you have 7+ samples per group. GSEA Preranked is generally used as a fallback if there is a specific reason that the standard modes aren't working (i.e. you know you want to use an external Log2FC calculation for ranking instead of our default signal to noise calculation, or you only have N=2 and our internal metrics error, or you are unable to output the appropriate normalization for your data). One isn't strictly better or worse than the other.

DAVID and GSEA have completely different underlying methodology and require different inputs. It is understandable that they would take different inputs and give somewhat different results, it is quite common to use different methodologies like this and to report results side by side. We obviously would have a bias towards the results from GSEA, but hits that are common across multiple methodologies are generally stronger candidates than those that are called by just one.

Let me know if you have more questions or anything here was unclear

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/c53ab3c5-9d11-45bc-9adf-82f7d571117an%40googlegroups.com.

Giorgia Silvestrini

unread,

Jan 4, 2022, 5:05:41 PM1/4/22

to gsea...@googlegroups.com

I don't understand why you are telling me that I can't provide as input a .rnk list, thus preranked, with log2fc values when yesterday it seemed to me that you said the opposite: "As for which output of the DESeq2 ranked list to use though, I would probably recommend the Log2 Fold Change column, or the Test Statistic (Stat) column, you should get reasonable (albeit slightly different) results from either of those."

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/SJ0PR05MB76093073B2029B40EA4BE098F74A9%40SJ0PR05MB7609.namprd05.prod.outlook.com.

Anthony Castanza

unread,

Jan 4, 2022, 5:12:54 PM1/4/22

to gsea...@googlegroups.com

Hi Girogia,

I never said you couldn't! You absolutely can provide an unfiltered log2FC or Stat list in .rnk format as input for GSEA Preranked. That is a perfectly valid way to run GSEA.

I was trying to answer your question about what is preferred in general; the GCT format or the RNK format. GCT format using normalized counts is generally the preferred workflow under the circumstances I outlined, but preranked is perfectly fine assuming the data meets GSEA's "complete dataset" expectations.

My initial answer which you've quoted was specifically what to use for preranked from the DESeq2 results table.

Let me know if you have more questions

-Anthony

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAOhj_QMWcxFpu7kp-MJuOBx3aAv_14zLD0Bgft3gAOgay_U5Bg%40mail.gmail.com.

Giorgia Silvestrini

unread,

Jan 4, 2022, 5:29:56 PM1/4/22

to gsea...@googlegroups.com

Hi Anthony, sorry I probably confused your answers to my questions. If we wanted to summarize what was said in points: 1) if I want to do the PreRanked analysis with the .rnk file, I have to use an unfiltered list of genes with the respective log2FC value next to it. 2) if I want to do the analysis with the .gct file I have to use the normalized counts and I have to provide as input to GSEA a list of genes with the counts previously filtered by cutoff or it must be unfiltered? So, for example, if I have a list of 40000 genes with their counts, I will provide to GSEA all 40000 genes and not the DEGs that I would have obtained if I had filtered these 40000 genes for Pvalue and foldchange.

Thanks,

Giorgia

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/SJ0PR05MB7609F0EEED9FD5D494AD3649F74A9%40SJ0PR05MB7609.namprd05.prod.outlook.com.

Anthony Castanza

unread,

Jan 4, 2022, 5:41:09 PM1/4/22

to gsea...@googlegroups.com

Hi Girogia,

I'll address these in-line:

1) if I want to do the PreRanked analysis with the .rnk file, I have to use an unfiltered list of genes with the respective log2FC value next to it.

- This is correct.

2) if I want to do the analysis with the .gct file I have to use the normalized counts and I have to provide as input to GSEA a list of genes with the counts previously filtered by cutoff or it must be unfiltered? So, for example, if I have a list of 40000 genes with their counts, I will provide to GSEA all 40000 genes and not the DEGs that I would have obtained if I had filtered these 40000 genes for Pvalue and foldchange.

- The gct file should contain normalized counts for all genes that were expressed in the samples. This should NOT be filtered by log2fc or pValue. The only filtering that should be done would be to remove genes with zero counts (summed across all samples).

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAOhj_QO_JfhcjYrVNEmLix%2By-zzkubk5hG%3D81ci4h82GD76M2A%40mail.gmail.com.

Reply all

Reply to author

Forward