Collapsed dataset empty when used with ...

65 views
Skip to first unread message

Joseph Lai

unread,
Jan 1, 2024, 4:33:42 AM1/1/24
to gsea-help

Hi, 
I see that there are multiple users with this issue but none of the previous posts I find seem to be helpful.
I'm using a custom gene set database to do GSEA on my RNAseq data. 
I formated my RNAseq data, total of 4 samples (2 conditions)
and the expression dataset files look like this
symbol R-1    R-2   D-1   D-2
CYP4F2 1755.574557 2509.340295 0 0
MIR7-3HG 3106.928985 4578.954719 0 1.060224549
CGA 1054.804671 2597.500494 0.978846375 0
CALN1 444.368404 673.0278597 0 0

for gene sets database, I self made a .gmt file. it looks like this
1    EGFR
2    YAP1
....
my CLS file looks like this
4 2 1
# R D
0 0 1 1

No matter what Chip platform I choose, it always gives me the error message
"The collapased dataset was empty when used with chip:..."

My samples are human cell lines, RNA-seq with illumina Hiseq, analyzed and processed, and I used the normalized counts to upload. The gene name in my list are standard official gene symbols, should I convert it to ensembl ID? Or is there any other obvious causes for this error?

Thanks to all the help!
Best 
J




Anthony Castanza

unread,
Jan 1, 2024, 2:28:41 PM1/1/24
to gsea...@googlegroups.com
Hi Joseph,

If the gene symbols in your input data files (both expression dataset, and gene set database) already match (e.g. are from the same symbol source), then you may not need to run collapse at all.
That said, since you're getting this error here, there is probably something wrong with the data structure. Instead of including copy-pasted snippets, could you open the dataset in a plain text editor and provide a screenshot (include the gene set database and expression dataset files).
You'll want to ensure that all your input files comply with the file formats specified in our documentation as strictly as possible: https://docs.gsea-msigdb.org/#GSEA/Data_Formats/

I'll also note that for a 2 vs. 2 comparison, you won't be able to use the default GSEA metric for ranking genes, as this requires a minimum of three samples to compute the standard deviation. You might, instead, use log2_ratio_of_classes, or supply an externally calculated differential expression list to the GSEAPreranked mode.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/fc9ec9d3-0d5e-4ace-a18d-353f0ff75320n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages