how to understand geneset permutation

Dong

unread,

Nov 11, 2021, 12:06:39 AM11/11/21

to gsea-help

Hi Anthony,

I kind of understand how phenotype permutation works. I also kind of know how geneset is permutated: the geneset is constructed by randomly choosing the same number of genes from genelist. What I don't know is then how to calculate ES, NES, and FDR. Calculate score by comparing permutated geneset with pre-selected geneset?

I am not sure whether I describe the question clearly. But it will be great if you can understand what I mean.

Thanks.

Dong

Anthony Castanza

unread,

Nov 11, 2021, 1:29:51 PM11/11/21

to gsea...@googlegroups.com

Hi Dong,

The calculations for the GSEA metrics are described in the "GSEA Statistics" section of the GSEA User Guide: https://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

And in the GSEA PNAS publication: https://www.gsea-msigdb.org/gsea/doc/subramanian_tamayo_gsea_pnas.pdf

But essentially, yes. GSEA computes Enrichment Scores using a weighted running sum statistic for each gene set in the real gene list and each of the permuted gene lists then compares the real score against the distributions of permuted scores.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/e29b3de5-cb50-4fc7-a3d6-7977b6ac3c68n%40googlegroups.com.

Dong

unread,

Nov 11, 2021, 10:36:20 PM11/11/21

to gsea-help

Hi Anthony,

Thanks for sharing the information and publication. I tried to understand "each of the permuted gene lists". So, geneset permutation means that creating geneset by randomly choosing same number of genes from real gene list? Since the number of genes in the real gene list is always larger than that of geneset. As a result, the permutation of real gene list could produce a number of different genesets. Then GSEA would compute the ES for geneset in each of these different genesets?

It seems a little complicating...

Thanks.

Dong

unread,

Nov 13, 2021, 10:08:30 AM11/13/21

to gsea-help

Hi Anthony,

I haven't heard from you yet. I think that my question may be confusing. So, I will try to present my questions in different ways.

1. Geneset permutating. Basically, the same number of genes (the same number as geneset) will be randomly chosen from genelist to constructed? These genes have to be the same as geneset? (I guess not).

2. 1000 times of permutation. This will produce 1000 random genesets from genelist?

3. What is the next step for those genesets?

4. Could you please give one example by using one specific geneset and one genelist?

Thanks.

Dong

unread,

Nov 13, 2021, 10:36:47 AM11/13/21

to gsea-help

Hi Anthony,

I read a little more. Below the text I copied from the paper. Basically, gene set permutating is also called gene sampling. Based on the description, the randomly assembled gene sets lose connection to a given gene set Gi. The only feature that is maintained is the number of genes in the gene set. The genes in the randomly assembled gene sets could be completely different from those in a given gene set Gi. Then it seems not fair to compare the ES score of those assembled gene sets with that of a given gene set Gi?

The score S in the text is enrichment score? And how to understand the highlighted text?

Thanks.

3.2.1. Gene Sampling

In gene sampling the significance of a gene set score S(Gi) for a given gene set Gi is assessed by comparing it to the scores of randomly assembled sets of ||Gi|| genes from the reference set U, i.e., all genes under study. In gene sampling method, a large number of random gene sets are assembled, and their scores are calculated. Then the significance value of the gene set score of Gi is calculated as the fraction of assembled gene sets that lead to stronger scores than the score of Gi, where a score in comparison to another is considered stronger if it is more in favor of rejecting the null hypothesis of interest.

Since gene sampling does not depend on the number of samples, it has been widely used for gene set analysis of datasets with small sample sizes (Subramanian et al., 2005; Tian et al., 2005; Ackermann and Strimmer, 2009). The main shortcoming of gene sampling is that it relies on the unrealistic assumption of independence between genes within a gene set. Usually genes within a gene set show a highly correlated behavior; therefore, a gene sampling method may incorrectly predict a gene set as differentially enriched only because of high correlation between its genes. In this regard, it may cause false positive predictions. Another shortcoming of gene sampling is being computationally demanding. For each gene set Gi, the whole process of gene set score calculation should be repeated for a large number of randomly assembled gene sets. In implementations of the gene-sampling approach, usually the number of assembled gene sets is an order of magnitude of 1,000. This number of repetitions makes the significance evaluation computationally demanding. Moreover, gene sampling may lead to a lack of statistical reliability of the significance values for large gene sets (Keller et al., 2007). Even using an order of magnitude of 1,000 assembled gene sets may not be enough to represent the background distribution; therefore, the significance value for large gene sets may not be statistically reliable.

From <https://www.frontiersin.org/articles/10.3389/fgene.2020.00654/full>

Anthony Castanza

unread,

Nov 13, 2021, 2:44:48 PM11/13/21

to gsea-help

The gene_set permutation mode, which we acknowledge is inferior to the phenotype permutation mode, tests gene sets on the basis of how likely it is that a random gene set of a given size was to be enriched within the given dataset.

The results from this distribution of random enrichment scores calculated as a result of sampling random gene sets that would be the same size as the set of interest, are then compared to the true enrichment score of the identically sized real set to determine if the observed enrichment is more extreme than would be expected if the true set, like the random sets, had no functional connection to a given process.

In this permutation mode, GSEA constructs a "null" distribution of sets that are random and therefore are assumed to have no coordinated biological function, therefore the null hypothesis would be that the given real set has no coordinated biological function within the data, an enrichment more extreme than that observed in the null distribution (sets that we "know" are random and have no coordinated biological function) would allow us to reject the null hypothesis and say that the set does have a coordinated function at [pValue] level of probability.

Does that make sense?

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/e34a2600-8f50-46cf-80bc-9fd8a0cc66ebn%40googlegroups.com.

Dong

unread,

Nov 14, 2021, 10:26:38 PM11/14/21

to gsea-help

Hi Anthony,

Thank you so much for the detailed explanation. I think that now I understand how the geneset permutation work!