Hello,
The issues here likely stem from variances between the sample phenotype assignments in the CLS file and the sample order in the GCT file.
The CLS file format is extremely finicky and requires precise implementation of the format as described in the specification.
The CLS file needs to faithfully replicate the full number of samples in the file with no omissions, and the order that the samples' phenotypes appear in.
For example, say you have three phenotype categories A, B, and C
and two samples of each phenotype, so six samples, Sample1 though Sample6
If Sample2 and Sample 4 are phenotype A
Sample1 and Sample6 are phenotype B
and Sample 3 and 5 are phenotype C
If you GCT file has columns ordered
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
meaning that the order of the phenotypes is
PhenB PhenA PhenC PhenA PhenC PhenB
then your CLS file would look something like
6 3 1
# B A C
B A C A C B
the third line can alternatively be represented by integer values as in;
then your CLS file would look something like
6 3 1
# B A C
0 1 2 1 2 0
where B=0, A=1, C=2
because B, A, C is the order that the samples appeared in the dataset file, and the integer representation of the sample order is zero indexed.
The first line defining; number of samples, number of phenotypes, 1
the second line defining the order that each phenotype first appears in the dataset
the third line defining the phenotypes of each sample in the order that the samples appear in the dataset
If your CLS file is properly structured and matches the dataset file, then you should be able to pick any given set of comparisons without any mislabeling issues.
As a note here, the heatmap being mislabeled indicates that the underlying dataset had the samples incorrectly categorized causing an incorrect differential expression computation to be performed, the analysis as it was done should not be used.
Let me know if you have any additional questions, or you continue to see errors in the phenotype assignments in the dataset. We'd be happy to take a look and try to correct the issue directly.
If you'd like us to do so, you can confidentially send your files to
gsea...@broadinstitute.org and we can take a look (although please let us know here if you do this so that we can be sure the files come through).
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego