Hi Priya
I'm not seeing anything obviously wrong with the way the dataset is constructed, however the GCT headers don't quite match the spec exactly: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GCT:_Gene_Cluster_Text_file_format_.28.2A.gct.29
You could try changing the "Identifier" and "name" columns to "NAME" and "Description" respectively. Just to see if that might be the issue.
That said, you're going to run into a couple of additional issues with this dataset. With only two samples for one of the phenotypes, you're not going to be able to run this dataset with the default Signal2Noise ranking metric as this requires a minimum of three samples per phenotype. To get around this, you'd have to change to another option, like log2_ratio_of_classes. However, at that point, I noticed that there are quite a few genes with zeros for all samples, those will likely cause divide-by-zero errors in this mode. You'd also need to run in gene_set permuation mode.
A better option might be to compute differential expression with something like DESeq2 and use the resulting list of Log2FC or the wald statistic as a ranked list in GSEA Preranked. This option is probably the most straightforward way to handle this particular dataset.
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
gsea-help+...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/gsea-help/8ee17377-ff52-4839-b912-44ba5e1c8been%40googlegroups.com.