Native features vs Gene symbols

2 views
Skip to first unread message

Maria Villa

unread,
Dec 9, 2025, 8:48:39 AMDec 9
to gsea-help
Hello, I would like to ask what is the meaning of this in GSEA analysis of RNAseq data:
"The dataset has 55634 (for example) native features After collapsing features into gene symbols, there are: 35452 genes "
What exactly are "native features" and why do they "collapse" into a smaller number of genes?
Thank you very much in advance,
Sincerely,
María.

David Eby

unread,
Dec 9, 2025, 12:11:47 PMDec 9
to gsea...@googlegroups.com
Hi Maria,

This terminology traces back to the early history of the project, when GSEA was commonly used to analyze microarray data.  In that case, "native features" could be alternately termed as "platform features" or, in more plain language, microarray probe identifiers.  In such an analysis, there will be multiple probes allocated to a single gene and so the measure of gene expression needs to be determined from the entire group of probes.

Doing such a calculation across features is what we call "collapsing", and there are different ways to do it depending on the type of data you are using: averaging, max value, absolute max value, etc.

This still applies to some more modern GSEA analyses, however, so it's not totally accurate to write it off as an aspect of historical data.  My colleague Anthony can give you a more detailed answer and recommendation if you have more specific questions about a particular type of data or analysis.

Regards,
David


--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/35c2e290-d95b-468b-9580-cbd1638ffe8en%40googlegroups.com.

Maria Villa

unread,
Dec 10, 2025, 5:55:32 AMDec 10
to gsea...@googlegroups.com
Dear David, thank you very much for your answer. I understand that collapsing is a necessary step when using microarray (and probes for detection of genes).
However, if the source of data is RNAseq, the dataset is supposed to be formed by unique gene identifiers, right? Then, if this parameter ("Collapse") is set at GSEA analysis, the number of features should not be reduced, right?
But that is not the case with this analysis I am doing: my dataset (from RNAseq) has 55634 supposedly unique gene IDs, but when I perform GSEA analysis, if I set "collapse" (because this is the by-default option), the result indicates that "After collapsing features into gene symbols, there are: 35452 genes". Does this mean that the 55634 features were not unique gene IDs?
Thank you very much in advance.
Regards,
María.

You received this message because you are subscribed to a topic in the Google Groups "gsea-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gsea-help/h209lrD-2jI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/CAFEo9Xj5rM_tZXvXu2rAadURkiF1AoX_szZvfkS%3D9%2BN1ZECnDw%40mail.gmail.com.

Anthony Castanza

unread,
Dec 10, 2025, 3:21:20 PMDec 10
to gsea...@googlegroups.com
Hi Maria,

For RNA-seq data there are many cases where there are several ("ENSG000xxx") gene assemblies that map to a single canonical annotated gene symbol. This can be for many reasons including chromosomal variations that are represented by variant constructs mapped to alternate assemblies, or a series of transcripts produced from the same locus that are, for whatever reason, not properly annotated as transcript variants and are assigned their own unique ID, but the gene naming consortium has assigned the same symbol. There are also many duplicated genes like U1/2/3 spliceosomal RNAs, LNCRNAs, etc that have many identical or nearly identical copies that are represented by different ENSG IDs, but are condensed to the same gene symbol as they are functionally identical, and largely indistinguishable by standard shotgun sequencing approaches. There are also cases with historical RNA-seq data where constructs that had previously been annotated as different genes have since been condensed to a single gene (such as two different protein products that are produced by a single parent mRNA). The degree of condensation from the mapping step is going to vary depending on the transcriptome assembly used in your initial quantitation, as well as the method used for that quantitation, but 55k RNA constructs being reduced to 35k annotated genes is about the ballpark of what we would expect to see. Some of that is also likely to be a reduction from filtering out novel constructs that lack annotation by other resources (e.g. not having an ID assigned by NCBI in addition to their Ensembl ID). GSEA's default behavior excludes these low-evidence constructs, although they can be included through the adjustment of  one  of the GSEA advanced fields, they will never appear in any MSigDB gene set.

We always recommend running GSEA with a collapse mode enabled, even if operating in the gene symbols namespace already as GSEA's tools enable remapping of old gene symbols to their current variants.
For the sake of understanding what GSEA is mapping, we do include full collapse mapping reports as part of GSEA's output.

Let me know if you have any additional questions!

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Maria Villa

unread,
Dec 11, 2025, 1:48:19 AMDec 11
to gsea...@googlegroups.com
Hi, Anthony, 
Thank you very much for the detailed explanation. Just an additional doubt. When you perform the analysis with the type of RNAseq data I mentioned (and assumming now that the best thing to do is enabling "collapse"), if the dataset does not include the Ensembl IDs but gene symbols, which one is the best Chip selection? Is it "Human gene symbol with remapping", "Human HGNC ID" or "Human NCBI gene ID"?
I usually select "Human gene symbol with remapping".
Thanks in advance.
Regards,
María.

Anthony Castanza

unread,
Dec 11, 2025, 1:00:18 PMDec 11
to gsea...@googlegroups.com
"Human gene symbol with remapping" is the best choice for a human dataset that is already in the gene symbols namespace, yes. Using this file makes sure that, as best as we possibly can, all the symbols in your dataset are converted to the same versions used in our gene set files, preventing any renamed genes from being erroneously excluded.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
Reply all
Reply to author
Forward
0 new messages