Phenotype permutation vs gene-set permutation

2,127 views
Skip to first unread message

Joe Horder

unread,
Jun 13, 2021, 1:56:34 PM6/13/21
to gsea-help
Hi there!

Looking for clarification of the use of the permutation type.

According to the user guide, phenotype permutation is suggested when all phenotypes in the data have at least seven samples. It also states that gene-set permutation is useful when there are less than seven samples in a given phenotype. Following on from this, the manual recommends using phenotype permutation wherever possible

Therefore, in an experiment with two phenotypes with three samples each, the gene-set permutation type is the 'best' to use? In which case the FDR significance cut off should be 0.05? Or would it still be worth running the phenotype permutation as well?

Thanks in advance,
Joe

Anthony Castanza

unread,
Jun 16, 2021, 2:20:14 AM6/16/21
to gsea...@googlegroups.com
Hi Joe,

GSEA uses a null distribution generated from permuting and retesting the data itself. With only three samples per group there aren't enough possible combinations to generate a reasonable null distribution with the default of 1000 permutations. The null distribution would be, for lack of a better word, contaminated by many many repeats of identical tests where a large proportion of the null would likely be identical to the "true" assignments. The FDR distribution in phenotype mode with this few samples wouldn't be able to generate a meaningful result, simply returning no significance.

So, in this case we recommend gene_set permutation (and the alternative 0.05 FDR cutoff) as it's much easier to build a null distribution when sampling from many thousands of genes.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/36023d04-e9a6-4d4e-aba4-827640a88ff7n%40googlegroups.com.

Joe Horder

unread,
Aug 24, 2021, 10:13:09 AM8/24/21
to gsea-help
Hi Anthony,

Thank you very much for this response!

And just to follow up - I believe that using the preranked analysis inherently uses gene_set permutation, and therefore 0.05 FDR cutoff should alwyas be used when running GSEA preranked?

Joe

Anthony Castanza

unread,
Aug 24, 2021, 12:09:29 PM8/24/21
to gsea...@googlegroups.com

Hi Joe,

 

This is correct.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Joshua Lyu

unread,
Jul 24, 2022, 9:10:10 PM7/24/22
to gsea-help
Hi there, can we still use gene_set even when the sample size is greater than 7? And set the FDR cutoff as 0.05? Thank you!

Anthony Castanza

unread,
Jul 25, 2022, 2:14:07 PM7/25/22
to gsea...@googlegroups.com

Hello,

 

Yes, you can do this, however we generally find that while gene_set enrichment gives more “significant” results, the results from phenotype permutation are generally more robust when the data fits its expectations.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Feriel Benchenouf

unread,
Aug 5, 2022, 5:35:14 AM8/5/22
to gsea-help

Hello,

I have two phenotypes/groups with nine samples each. I performed a GSEA with phenotype permutation (1000 permutations), but I had no significant genesets with an FDR <25%. However, I did when I tried a geneset permutation.
I don't understand why I don't have any significant geneset with phenotype permutation.

Furthermore, my dataset contains 24658 genes, but only 23300 genes were collapsed for the analysis; I don't understand this either. Can I change something in the parameters, so all my genes are used during analysis?

Thank you in advance for your explanations

Feriel

Anthony Castanza

unread,
Aug 5, 2022, 2:58:02 PM8/5/22
to gsea...@googlegroups.com

Hi Feriel,

 

Phenotype permutation is a very stringent procedure that can frequently lead to low numbers of significant results, however, generally these are the most robust results. It isn’t uncommon to get no significant results if the phenotype being assessed didn’t result in particularly large molecular changes in the samples. Did you perform any kind of hierarchical or PCA clustering to determine if you have good separation between your phenotypes? Perhaps you have outliers that are causing issues with the phenotype permutation that should be excluded. Additionally, how was this data normalized? GSEA should generally be run on normalized counts, not raw counts or FPKM/RPKM/TPM data. Additionally, we recommend eliminating genes that aren’t expressed above a reasonable threshold in the data. This prevents unreasonable inflation of the global distribution with irrelevant genes.

 

Gene Set permutation is much more permissive than phenotype permutation, scrambling the genes in the sets rather than the samples in the phenotypes to construct the null distribution. This mode only really assesses how likely it is that a set of a given size was to be enriched to that degree in the data, rather than how likely it is that a set was enriched in that phenotype compared to a random phenotype.

 

A reduction to 23300 genes, from 24658 through collapsing the dataset is both reasonable and expected. Our Collapsing tool eliminates genes that don’t have an annotated symbol match and combines any gene symbols that might be annotated by multiple IDs (i.e. an Ensembl gene on a patch contig, and an Ensembl gene on the primary assembly will be combined). You can partially disable this behavior but we do not recommend it. In the advanced fields section, you can set the “Omit features with no symbol match” parameter to “false” instead of its default “true”, this will allow GSEA to keep any genes that did not match anything in the chip file, but it will still cause any multiple id-to-symbol mappings to be combined (which is necessary).

 

Finally, in the future, we’d ask that you create a new topic for issues specific to your dataset, to prevent notifying the original posters of a given topic with answers that aren’t relevant to their original issue. Thanks!

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Reply all
Reply to author
Forward
0 new messages