Cannot run GSEA with my RNAseq data

450 views
Skip to first unread message

Edward Liu

unread,
Aug 23, 2018, 1:13:51 AM8/23/18
to gsea-help
Hello,

I cannot run GSEA with my RNAseq data (3 samples, 502 genes).
Have make files into the format which can be accepted by the GSEA program.
And successfully loaded the 1) Datasets 2) Phenotype labels.
The gene name of my gene lists are all convert to Ensembl ID (for example, BRCA2 -> ENSG00000139618)
Chip platform: ENSEMBL_mouse_gene.chip

However, as I tried to run GSEA, the screen showed "After pruning, none of the gene sets passed size thresholds."
Have tried some ways to solve it.
Collapse dataset to symbols "false"
And change Max size: exclude larger sets "500 -> 600"
And tried several Gene sets database but still show the same error window.

Does anyone have any idea?

Thanks.

Best,
Edward

David Eby

unread,
Aug 23, 2018, 8:48:46 AM8/23/18
to gsea-help
Hi Edward,

That's a very small number of genes for a GSEA analysis.  What's likely happening is that all the gene sets are being screened out as not having enough members matching the genes in your dataset.  The threshold of interest here would be the *Min size* threshold.  You could try moving that downward, but that would mean you are effectively operating with very small gene sets containing only a few members each and thus questionable significance.

It's best to use larger datasets for GSEA analysis, at least thousands of genes and even whole genome dataset (tens of thousands).  GSEA does not generally benefit from filtering, except for possibly any low-count genes in the case of RNA-Seq data.  See our Wiki and User Guide for more information.

Regards,
David

Neville, Peggy

unread,
Aug 23, 2018, 4:33:31 PM8/23/18
to gsea...@googlegroups.com

Hi David,

 

This is Peggy Neville.  I, too, am having trouble with my RNA-seq database, which I have attached.I have shorteded it to about 10,000 genes by removing the low expressors (<50) as well as genes where both duplicates are 0.  I keep getting an error message saying essentially the first line cannot be read:

I also attach the text file.  Thank you for your help.

 

Peggy Neville

GSEA.txt..txt

Edward Liu

unread,
Aug 24, 2018, 1:33:11 AM8/24/18
to gsea-help
Hi David,

Thanks for quick replying on my question.

Our RNAseq results were analyzed and selected for the significant fold changes. Thus, there remained only 502 genes in the file. 
According to your suggestion, I should go back to get the whole sequencing data (before filtering) and input the whole genes into GSEA program.
Otherwise, if the gene number is too less, I cannot compare the RNAseq data with the gene set from database.

Am I correct?

Best,
Edward


David Eby於 2018年8月23日星期四 UTC+8下午8時48分46秒寫道:

David Eby

unread,
Aug 24, 2018, 4:33:31 PM8/24/18
to gsea-help
Hi Peggy,

There are several issues with the file.  GSEA is particular about the TXT files it accepts; they need to match the rules in the Data Formats page of our Help Wiki.  

It looks the first error here is because of the "id" column.  Since GSEA didn't see the "DESCRIPTION" header, it interpreted this as a sample column rather than an annotation field.  Switching the name of that column header allows it to load.

However, there are several columns here that are NOT expression values (logFC, PValue, FDR).  Those should be dropped from the file or GSEA will interpret them as additional samples.

Lastly, to use this dataset with one of our Ensemble CHIP files, it is necessary to trim the version suffix from all of the Ensemble IDs (for example, ENSG00000136155.16 -> ENSG00000136155).  See here for more details.

I hope that helps.

David Eby

unread,
Aug 24, 2018, 4:39:57 PM8/24/18
to gsea-help
Hi Edward,

Yes, that's correct.  Think of this as a separate line of analysis from the logFC comparison rather than a downstream step.

You should still preprocess and normalize as usual, and it's beneficial to filter genes with low-count expression across the dataset.  Details are here.

Neville, Peggy

unread,
Aug 24, 2018, 6:34:37 PM8/24/18
to gsea...@googlegroups.com
Thank you, David.  I finally found the .text instructions and was able to load the file--after doing all you suggested.  The program still doesn't like my column names but I will see if I can fix this myself.
I will get back to you if I have more questions.
Peggy


From: gsea...@googlegroups.com <gsea...@googlegroups.com> on behalf of David Eby <e...@broadinstitute.org>
Sent: Friday, August 24, 2018 2:33:31 PM
To: gsea-help
Subject: Re: [gsea-help] Re: Cannot run GSEA with my RNAseq data
 
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/311e1c9f-3a7c-4e74-bcbb-bd0ed7589697%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Neville, Peggy

unread,
Aug 25, 2018, 2:06:49 PM8/25/18
to gsea...@googlegroups.com

Hi Again, David,

 

Peggy Neville.  I succeeded in loading my RNA-SEQ database as a TXT file without errors and I am now on the Run protocol.  I am getting errors which I think are related either to the labels on the data columns in the text file (attached) or to the use of the wrong RNA-Seq platform.  My data are from human cells.  I attach a word file with a copy of the messages associated with the run.  The .cls file generated by the run is also attached.  It contains only data for the TSPAN row.

 

Thanks again for your help.

 

Peggy

 

From: gsea...@googlegroups.com [mailto:gsea...@googlegroups.com] On Behalf Of Neville, Peggy
Sent: Thursday, August 23, 2018 2:33 PM
To: gsea...@googlegroups.com
Subject: RE: [gsea-help] Re: Cannot run GSEA with my RNAseq data

 

Hi David,

 

This is Peggy Neville.  I, too, am having trouble with my RNA-seq database, which I have attached.I have shorteded it to about 10,000 genes by removing the low expressors (<50) as well as genes where both duplicates are 0.  I keep getting an error message saying essentially the first line cannot be read:

I also attach the text file.  Thank you for your help.

 

Peggy Neville

--

You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

TSPAN6_profile_in_CLDN4_OVCAR3_for_GSEA.cls
CLDN4 OVCAR3 for GSEA.txt
ERROR on Run Aug 25..docx

David Eby

unread,
Aug 27, 2018, 3:02:21 PM8/27/18
to gsea-help
Hi Peggy,

There's no need for you to use the Collapse Dataset feature since you already have expression data at the gene-level and are working with gene symbols instead of working at the transcript-level or with microarray probes.  That's only needed if you need to "collapse" multiple probe-level values to a single gene-level value, or if you are converting from one gene symbol namespace to another (like Ensembl -> HUGO).  Just set the "collapse" option to false to skip this step.

However, I notice that your dataset has only two samples.  Running an analysis with so few samples requires changes to a couple of the default settings.  See the Run GSEA Page in our User Guide for more details.  You'll need to changes the Permutation Type to "gene_set" as "Phenotype" (the preferred setting) requires at least 7 samples per phenotype.  You'll also need to choose a different metric as the default Signal2Noise requires at least 3 samples per phenotype.

You should consult your local statistician to choose an appropriate metric for your data, and also to determine whether two samples will be sufficient for a meaningful analysis.

Regards,
David

On Sunday, August 26, 2018 at 3:06:49 AM UTC+9, Neville, Peggy wrote:

Hi Again, David,

 

Peggy Neville.  I succeeded in loading my RNA-SEQ database as a TXT file without errors and I am now on the Run protocol.  I am getting errors which I think are related either to the labels on the data columns in the text file (attached) or to the use of the wrong RNA-Seq platform.  My data are from human cells.  I attach a word file with a copy of the messages associated with the run.  The .cls file generated by the run is also attached.  It contains only data for the TSPAN row.

 

Thanks again for your help.

 

Peggy

 

From: gsea...@googlegroups.com [mailto:gsea-help@googlegroups.com] On Behalf Of Neville, Peggy
Sent: Thursday, August 23, 2018 2:33 PM
To: gsea...@googlegroups.com
Subject: RE: [gsea-help] Re: Cannot run GSEA with my RNAseq data

 

Hi David,

 

This is Peggy Neville.  I, too, am having trouble with my RNA-seq database, which I have attached.I have shorteded it to about 10,000 genes by removing the low expressors (<50) as well as genes where both duplicates are 0.  I keep getting an error message saying essentially the first line cannot be read:

I also attach the text file.  Thank you for your help.

 

Peggy Neville

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+unsubscribe@googlegroups.com.

Neville, Peggy

unread,
Aug 28, 2018, 3:54:45 PM8/28/18
to gsea...@googlegroups.com

Hi Again, David,

 

This may be the hardest program I have ever tried to use.

 

I have parsed my data, which are only duplicates but somehow came back with probabilities.  I was taught that you needed three samples for statistical analysis but the RNA Seq data came back with very well matching duplicates for two conditions—a control culture and culture with a gene knock out.

 

I have now put all four datasets into my .txt file organized like this:

 

Gene Name

Description

Ctrl_1

Ctrl_2

KO_1

KO_2

MARC1

ENSG00000186205.12

1980

1739

401

388

MARCH1

ENSG00000145416.13

31

24

105

123

MARCH2

ENSG00000099785.10

286

270

525

660

MARCH3

ENSG00000173926.5

953

752

368

346

 

The Program accepted this file but still has other files in the cache which I cannot get rid of and which only have the average of the two datasets in each condition.

 

The next problem is the run GSEA.

For parameters I made a file in the .cls format

 

4

2

1

# OVCAR3

Ctrl

KO

Ctrl

Ctrl

KO

KO

 

My goal is to determine what pathways are altered in the knock out samples.  I already know that genes associated with the mTOR pathway are downregulated in the knockout and those associated with the HIPPO pathway are upregulated simply by examining the data.  In addition the ctrl cultures express many genes classified as epithelial which are downregulated in the KO.  Many genes in the KO samples are classified as mesenchymal with low expression of epithelial genes. These are large changes. 

 

My question for the gsea analysis is what other pathways are affected by the loss of the knock-out gene (which happens to be CLDN4, if you are curious).  I am using the h.all v 6.2 symbols.gmt database.  Would a different one be better?

 

Phenotype labels as above.

 

Collapse database:  false

 

Permutation:  gene set

 

Chip platform:  RefSeq_human.chip

 

The error message was “As the phenotype was continuous, only continuous class metrics are allowed.  Apparently I have designed my .cls file incorrectly.  Should I just go back to the two column format and if so,  what should the .cls file look like.

 

Sorry to trouble you again, but I really would like to make this analysis work.

 

Peggy Neville

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "gsea-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/9b15c541-317c-471f-ac8b-55c740af0aa8%40googlegroups.com.

David Eby

unread,
Aug 29, 2018, 12:40:53 PM8/29/18
to gsea-help
Hi Peggy,

I'll answer your questions in-line below to keep the context...


On Wednesday, August 29, 2018 at 4:54:45 AM UTC+9, Neville, Peggy wrote:

Hi Again, David,

 

This may be the hardest program I have ever tried to use.


I'm sorry that you feel that way.  We're working with a very old code base and limited resources, so we haven't been able to make some of the usability fixes we would like, particularly around better file formats, improved parsing and error messages, etc.  We're hoping to ramp up development again soon and the topic will certainly come up.
Do you mean the Recently Used Files list?  You can clear it with "Files > Clear recent file history" or by right-clicking on any of the files and choosing "Purge all files".
 

The next problem is the run GSEA.

For parameters I made a file in the .cls format

 

4

2

1

# OVCAR3

Ctrl

KO

Ctrl

Ctrl

KO

KO

 

It looks like there may be a formatting error here, though it's hard to know with copy-pasted contents.  The class names should be on the same line as the '#' and there should only be two, based on the 2 in the first line.  I think this may be the source of the error you report later; GSEA is reading this as a Continuous CLS rather than a Categorical CLS.  See our Data Formats guide for details of the CLS format.
 
Also, where does OVCAR3 come into the picture?  Since you only have two phenotypes, the extra class name might also confuse GSEA.

FWIW, GSEA probably needs a simpler format for entering class data.  Many folks are confused by the CLS format.
 

My goal is to determine what pathways are altered in the knock out samples.  I already know that genes associated with the mTOR pathway are downregulated in the knockout and those associated with the HIPPO pathway are upregulated simply by examining the data.  In addition the ctrl cultures express many genes classified as epithelial which are downregulated in the KO.  Many genes in the KO samples are classified as mesenchymal with low expression of epithelial genes. These are large changes. 

 

My question for the gsea analysis is what other pathways are affected by the loss of the knock-out gene (which happens to be CLDN4, if you are curious).  I am using the h.all v 6.2 symbols.gmt database.  Would a different one be better? 

 
That's definitely a good starting point.  I'm not the right person to ask about the scientific aspects of experimental design or analysis, but in general we recommend folks start with the Hallmarks collection to get an overview.  That will give you a better idea of the "meta" view of pathway enrichment across MSigDB and may guide you in which other collections or subcollections to investigate for subsequent analyses.

Phenotype labels as above.

 

Collapse database:  false

 

Permutation:  gene set

 

Chip platform:  RefSeq_human.chip 


There's no need to select a Chip platform with Collapse=false.  This will be ignored in that case.
 

The error message was “As the phenotype was continuous, only continuous class metrics are allowed.  Apparently I have designed my .cls file incorrectly.  Should I just go back to the two column format and if so,  what should the .cls file look like. 


See above; GSEA is interpreting it as a continuous CLS.  That's used e.g. for time-series.
 

Sorry to trouble you again, but I really would like to make this analysis work.

 

Peggy Neville 


No trouble.  I hope it helps.

Regards,
David
 

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+unsubscribe@googlegroups.com.

Neville, Peggy

unread,
Sep 1, 2018, 5:33:52 PM9/1/18
to gsea...@googlegroups.com

YES!!  I got it to work.  It is the phenotype parameter that makes things really difficult.

 

May be back to you as I try to look at another dataset.

 

But thank you very much for your help.

 

Peggy Neville

 

From: gsea...@googlegroups.com [mailto:gsea...@googlegroups.com] On Behalf Of David Eby
Sent: Wednesday, August 29, 2018 10:41 AM
To: gsea-help <gsea...@googlegroups.com>
Subject: Re: [gsea-help] Re: Cannot run GSEA with my RNAseq data

 

Hi Peggy,

 

I'll answer your questions in-line below to keep the context...

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "gsea-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/aaac3156-30f5-4a90-a940-d85f961f96ef%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages