Mislabeling in Heatmaps/input problem?

16 views
Skip to first unread message

Maria Soverina

unread,
Jul 21, 2025, 11:57:11 PMJul 21
to gsea-help
Hello. I'm using the desktop version of GSEA (4.4.0 for Windows) to analyze RNASeq data.

As input for GSEA, I used:
- a normalized counts table (obtained using DESeq2) according to the specifications in the GSEA "documentation" section (txt format)
- a cls file indicating the phenotype levels (the two groups I'm comparing, each group with 3 replicates)
The analysis indicates "success," and upon checking the output, all the files appear "correct" except for the generated heatmap, which incorrectly places other groups/samples in place of the ones I'm comparing (mislabeling)

After seeing this, I decided to run a test and remove one of the groups I want to compare from the normalized counts table, but request its comparison using the cls file. I thought it would give an error because when searching for the sample indicated in the cls file, it wouldn't find it in the normalized counts table. However, the result was "success" and all the output files were correctly titled, indicating the compared groups except for the heatmap (again, it mentioned a different sample than the one I wanted to compare).

After seeing this, I decided to run another test, removing all the groups from my normalized counts table and leaving only the two groups I wanted to compare (those indicated in the CLS file). However, the heatmap was still mislabeled, indicating a group that didn't exist in my normalized counts table.

I made sure to delete my uploaded files in the "load data" section and also clicked "clear recent file history" before rerunning the test, but I got the same results. I opened and closed the desktop application multiple times to make sure the new files I uploaded in the "load data" section were updating, but it didn't make any difference.

Could you help me find the problem or what I'm doing wrong?

I also ran the same groups using the GSEA Preranked method; however, the output files don't include the names of the groups compared, making it difficult to manage/visualize the data, especially when there are multiple comparisons to make from a bulk RNASeq dataset with several groups.

Thank you in advance.

Anthony Castanza

unread,
Jul 22, 2025, 1:08:49 PMJul 22
to gsea-help
Hello,

The issues here likely stem from variances between the sample phenotype assignments in the CLS file and the sample order in the GCT file.
The CLS file format is extremely finicky and requires precise implementation of the format as described in the specification.
For reference, the formats are described here: https://docs.gsea-msigdb.org/#GSEA/Data_Formats/#phenotype-data-formats

The CLS file needs to faithfully replicate the full number of samples in the file with no omissions, and the order that the samples' phenotypes appear in.

For example, say you have three phenotype categories A, B, and C
and two samples of each phenotype, so six samples, Sample1 though Sample6
If Sample2 and Sample 4 are phenotype A
Sample1 and Sample6 are phenotype B
and Sample 3 and 5 are phenotype C

If you GCT file has columns ordered
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6

meaning that the order of the phenotypes is
PhenB PhenA PhenC PhenA PhenC PhenB

then your CLS file would look something like
6 3 1
# B A C
B A C A C B

the third line can alternatively be represented by integer values as in;
then your CLS file would look something like
6 3 1
# B A C
0 1 2 1 2 0

where B=0, A=1, C=2
because B, A, C is the order that the samples appeared in the dataset file, and the integer representation of the sample order is zero indexed.


The first line defining; number of samples, number of phenotypes, 1
the second line defining the order that each phenotype first appears in the dataset
the third line defining the phenotypes of each sample in the order that the samples appear in the dataset

If your CLS file is properly structured and matches the dataset file, then you should be able to pick any given set of comparisons without any mislabeling issues.

As a note here, the heatmap being mislabeled indicates that the underlying dataset had the samples incorrectly categorized causing an incorrect differential expression computation to be performed, the analysis as it was done should not be used.

Let me know if you have any additional questions, or you continue to see errors in the phenotype assignments in the dataset. We'd be happy to take a look and try to correct the issue directly.
If you'd like us to do so, you can confidentially send your files to gsea...@broadinstitute.org and we can take a look (although please let us know here if you do this so that we can be sure the files come through).

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Soledad Soverina

unread,
Jul 23, 2025, 4:13:23 AMJul 23
to gsea...@googlegroups.com
Hi Anthony,

Thanks for your reply. I tried to fix the problem, but I wasn't successful.
I just sent my files to gsea...@broadinstitute.org . Please let me know if it went through
Also, the link you provided shows an error message and the same thing happens when I try to navigate other parts of the documentation

image.png

Best,

Maria


--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/53226e59-bd5c-4db1-afdd-970de0b24863n%40googlegroups.com.
Message has been deleted

Anthony Castanza

unread,
Jul 25, 2025, 1:15:07 PMJul 25
to gsea-help
Hi Maria,

Glad we were able to solve your main issue! With regard to the docs site, it would appear that Google Groups broke the link I sent you, it converted a # symbol in the URL used to anchor the page to the correct section, to a %23 which isn't interpreted correctly. If you copy and paste the URL instead of clicking the embedded hyperlink it should work. Or you can navigate to the documentation site directly from the top nav bar on the gsea-msigdb.org website.

Please let us know if you run into any other issues!


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
Reply all
Reply to author
Forward
0 new messages