Hi Maria,
For RNA-seq data there are many cases where there are several ("ENSG000xxx") gene assemblies that map to a single canonical annotated gene symbol. This can be for many reasons including chromosomal variations that are represented by variant constructs mapped to alternate assemblies, or a series of transcripts produced from the same locus that are, for whatever reason, not properly annotated as transcript variants and are assigned their own unique ID, but the gene naming consortium has assigned the same symbol. There are also many duplicated genes like U1/2/3 spliceosomal RNAs, LNCRNAs, etc that have many identical or nearly identical copies that are represented by different ENSG IDs, but are condensed to the same gene symbol as they are functionally identical, and largely indistinguishable by standard shotgun sequencing approaches. There are also cases with historical RNA-seq data where constructs that had previously been annotated as different genes have since been condensed to a single gene (such as two different protein products that are produced by a single parent mRNA). The degree of condensation from the mapping step is going to vary depending on the transcriptome assembly used in your initial quantitation, as well as the method used for that quantitation, but 55k RNA constructs being reduced to 35k annotated genes is about the ballpark of what we would expect to see. Some of that is also likely to be a reduction from filtering out novel constructs that lack annotation by other resources (e.g. not having an ID assigned by NCBI in addition to their Ensembl ID). GSEA's default behavior excludes these low-evidence constructs, although they can be included through the adjustment of one of the GSEA advanced fields, they will never appear in any MSigDB gene set.
We always recommend running GSEA with a collapse mode enabled, even if operating in the gene symbols namespace already as GSEA's tools enable remapping of old gene symbols to their current variants.
For the sake of understanding what GSEA is mapping, we do include full collapse mapping reports as part of GSEA's output.
Let me know if you have any additional questions!
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego