Regarding RNA SEQ data as RPKM values

304 views
Skip to first unread message

Gautam Nisha

unread,
Mar 1, 2022, 4:32:54 PM3/1/22
to gsea-help
Hi

1. I have received RNA seq data from WT and KO animal model in the format of RPKM values. Is RPKM data acceptable as an input for GSEA analysis. 

2. I am getting good enrichment plots but heatmap don't show any difference color-wise. 

Here I am sending you the heatmap and enrichment plot. Kindly interpret in your way. So that I can relate.

Thanks
HALLMARK_APICAL_JUNCTION_25.png
enplot_HALLMARK_APICAL_JUNCTION_24.png

Anthony Castanza

unread,
Mar 1, 2022, 4:42:52 PM3/1/22
to gsea...@googlegroups.com

Hello,

 

RPKM is not generally considered to be appropriate for between-sample comparisons. There is a pretty good review article on the topic available here: https://rnajournal.cshlp.org/content/26/8/903.full.pdf

What I would recommend is to go back to whoever provided the RNA-seq dataset and request either the raw counts or the counts normalized by some appropriate method such as DESeq2's median-of-ratios, or TMM or similar. If they provide the raw counts, most implementations of DESeq2 through publicly accessible tools like those on GenePattern.org or usegalaxy.org provide the option to produce the "normalized counts" output. That normalized counts output would be appropriate for GSEA. Additionally, since you only have three samples per group, you would need to ensure that you're running GSEA in "gene set" permutation mode, rather than the default "phenotype" permutation mode.

 

As to the relative lack of difference in the heatmaps associated with the enrichment, it's difficult to say what is causing this without the proper normalization being used. It could be GSEA detecting a relatively weak signal which is something it is fairly well optimized for. That large red block in the middle is a group of genes that were not expressed in any of your samples, once you've renormalized the data, you might try filtering out such non-expressed genes (some tools like GenePattern's DESeq2 will do this by default). Under some circumstances that can also improve GSEA results.

 

Hope this helps, let us know if you have any more questions

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

http://gsea-msigdb.org/

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/74509c0e-f86b-4fb5-98b4-e5ddbb7e6c72n%40googlegroups.com.

Gautam Nisha

unread,
Mar 1, 2022, 4:49:40 PM3/1/22
to gsea-help
Okay 
thanks for the info. They provided me values named as QUANT ALL. They used QUANT ALL values for DEseq2 analysis. Should i use Quant all data?

Thanks

Anthony Castanza

unread,
Mar 1, 2022, 4:53:46 PM3/1/22
to gsea...@googlegroups.com

Hi,

 

If they used the "quant all" information for DESeq2 it was probably raw counts, in which case it would need to be normalized prior to GSEA, if you use the implementation of DESeq2 on GenePattern (cloud.genepattern.org, which is free and only requires a simple registration) one of the outputs will be a normalized counts .gct file that can be used for GSEA. Unfortunately I don't know exactly what "quant all" is, this isn't a standard name, so I'm only guessing here. I would recommend reaching out to the individual or group that provided this dataset and inquiring about specifically what procedures that were performed to generate that file, but it is probably correct.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Gautam Nisha

unread,
Mar 1, 2022, 5:06:16 PM3/1/22
to gsea-help
Dear Anthony

Can you please explain me in detail and more simpler way this highlighted text "  if you use the implementation of DESeq2 on GenePattern (cloud.genepattern.org, which is free and only requires a simple registration) one of the outputs will be a normalized counts .gct file that can be used for GSEA". 

Thanks!

Anthony Castanza

unread,
Mar 1, 2022, 5:12:28 PM3/1/22
to gsea...@googlegroups.com

If the quant all file contains raw counts, which is what I suspect it does since the provider of the file used it to run DESeq2 and DESeq2 requires raw counts as input, those counts will need to be normalized to be usable for GSEA. You can either go back to the provider of the data and ask them to export DESeq2's internal normalization table, or you can use the raw counts you've already been provided (probably) to rerun this normalization yourself.
I can provide you instructions for producing this normalization manually, or you can use one of the free publicly accessible online bioinformatics platforms to do this normalization, the two major options are Galaxy (usegalaxy.org) or GenePattern (cloud.genepattern.org). On the GenePattern option, which requires the creation of a free user account to run, you'd format the raw counts as a GCT file just as you would use for GSEA, but instead you'd search for and run the "DESeq2" tool, that DESeq2 tool will give you a series of outputs. One of those outputs will be a file that contains "normalized.counts.gct" that file can then be used as the input for GSEA.

 

Hopefully that is clearer, let me know if you have additional questions.

Gautam Nisha

unread,
Mar 1, 2022, 5:49:58 PM3/1/22
to gsea-help
Yes Anthony!
They provide me Raw reads with low coverage and average coverage. I think that is they used for DEseq2 and the quant all is one they obtained after DESEQ2. I think that is what you mean to say?

Gautam Nisha

unread,
Mar 1, 2022, 5:56:34 PM3/1/22
to gsea-help
Dear Anthony

I had talk with RNA seq data provider

The quant all data is the normalized data based on the mean of all the samples. Is this suitable for GSEA analysis?
Please confirm
thanks

Anthony Castanza

unread,
Mar 1, 2022, 6:06:45 PM3/1/22
to gsea...@googlegroups.com

They might be describing the TMM method of count normalization, TMM is a suitable method for normalization for GSEA.

You might ask specifically if that is what they are describing there. If they mean it's a simple normalization using a global mean without consideration of sample-specific factors, I'm not sure if that is appropriate it is not a method I've tried or seen any literature on.

Some more information on standard normalization methods here: https://academic.oup.com/bib/article/19/5/776/3056951

Gautam Nisha

unread,
Mar 2, 2022, 1:36:43 PM3/2/22
to gsea-help
Dear Anthony

thank for sharing great information.
Can you please provide me a detailed workflow to get normalized data using gene pattern through DESeq tool.

Thanks

Anthony Castanza

unread,
Mar 2, 2022, 3:24:14 PM3/2/22
to gsea-help
Hello,

Getting the normalized counts from GenePattern is simple, you need to take the raw counts data and format it into the same GCT format you used for your dataset previously, then, you would upload the dataset and the CLS file describing the sample group mappings to the DESeq2 module. After running the module, the ".normalized.counts.gct" file should be one of the default outputs. If you're having difficulty with the GenePattern module, you can send me the associated Job IDs and I can take a look. We work closely with the GenePattern team and they've been gracious enough to share debugging access with us.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Gautam Nisha

unread,
Mar 4, 2022, 12:28:19 AM3/4/22
to gsea...@googlegroups.com
Dear Anthony

Sure I will try similar way. 
Thanks

Gautam Nisha

unread,
Mar 4, 2022, 12:23:19 PM3/4/22
to gsea...@googlegroups.com
Dear Anthony can you guide how to create a gct file. I have created a file by saving excel in tab delimited. But it is not working, while I upload in DESeq2.

On Wed, Mar 2, 2022 at 3:24 PM Anthony Castanza <acas...@cloud.ucsd.edu> wrote:

Gautam Nisha

unread,
Mar 4, 2022, 1:09:18 PM3/4/22
to gsea...@googlegroups.com
Dear Anthony!

I created a gct file from the read counts file. But the DESeq2 is not accepting it as input. Please guide me. See picture attached. Thanks




gct format.png

Anthony Castanza

unread,
Mar 4, 2022, 1:53:25 PM3/4/22
to gsea-help
The formatting of this file appears fine, it would need to be saved out of excel as tab delimited text and then the file extension changed from .txt to .gct Some operating systems hide file extension by default so if you don't see a .txt extension on the saved file you'd first need to enable showing of known file extension, or sometimes they can be seen and edited from the "Properties" (Windows) or Get Info (MacOS) menus.

If the saving isn't the issue, since you're using GenePattern for this; we work closely with the GenePattern team and they've been gracious enough to give us debugging access on their server. If you send us the GenePattern Job ID number we can also take a look at the job directly if it's run and returned an error message.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Gautam Nisha

unread,
Mar 9, 2022, 1:03:48 PM3/9/22
to gsea...@googlegroups.com
Dear Anthony

I have run the DESeq2 analysis and got the default normalized. Counts.gct. I was looking into the by product files like unregulated and down regulated files. I observed that tool has just flipped the same set of genes the other way around. But the set of genes are common in both the files. Please clear this. 
Thanks for helping. 

On Wed, Mar 2, 2022 at 3:24 PM Anthony Castanza <acas...@cloud.ucsd.edu> wrote:

Anthony Castanza

unread,
Mar 9, 2022, 1:42:54 PM3/9/22
to gsea-help
Hi Gautam,

By default the GenePattern DESeq2 tool computes differential expression for the first phenotype defined in your CLS file vs the second phenotype defined in your CLS file. Does that look like what happened here? If so, this would not affect the normalization that was done in the counts, just how the differential expression is reported. 

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Gautam Nisha

unread,
Mar 16, 2022, 2:14:39 PM3/16/22
to gsea...@googlegroups.com
Hi Anthony

I tried running GSEA analysis with hallmark gene sets. So I found different genes differentially expressed between the two groups in different gene sets. But is there a way to drawn one cluster of differentially expressed gene from this and present in a single heat map?
Thanks 
Nisha Gautam

Anthony Castanza

unread,
Mar 16, 2022, 3:05:14 PM3/16/22
to gsea...@googlegroups.com

Hello,

 

I we don’t really have anything in the way of support for this feature. I might suggest taking a look at the EnrichmentMap Cytoscape package. If you use the load datasets functionality from withing Cytoscape itself rather than from the tool link-out in GSEA, it has the option to load multiple datasets and display overlap between them. https://enrichmentmap.readthedocs.io/en/latest/

Gautam Nisha

unread,
Mar 16, 2022, 3:13:08 PM3/16/22
to gsea...@googlegroups.com
Dear Anthony

Thanks and I will try that way too. I have another question, like I did GSEA analysis for hallmark and I get different heatmaps correpond to their gene set. But I also get a heat map , in which only consisteing of significantly differential gene  and is named as heat map 1. Any idea what that heat map exactly means? I will attach picture here. Is that differentially expressed gene set from all the heatmaps generated for indidviduals gene sets for hallmark pathway? If so can it be used directly for further testing like leading edge or pathway plot?


Anthony Castanza

unread,
Mar 16, 2022, 3:16:43 PM3/16/22
to gsea...@googlegroups.com

This is the heatmap of just the top and bottom 50 genes as ranked by the metric that was used to run GSEA (i.e., the signal_to_noise metric). It's mostly just included to provide a sanity check showing that there is good class discrimination in the top and bottom ranked genes.

Gautam Nisha

unread,
Mar 17, 2022, 1:44:28 PM3/17/22
to gsea...@googlegroups.com

Dear Anthony
Hi


I have run another gene set also like C2 all. I always get a heat map showing common genes differentially expressed as in hallmark gene sets and c1 positional gene sets. Can you please clear this, why I am getting this, or is this signify the genes which are common throughout the gene sets.
Thanks 

Anthony Castanza

unread,
Mar 17, 2022, 1:52:34 PM3/17/22
to gsea...@googlegroups.com

What heat map are you talking about? Could you send a screenshot of what you're referring to? You should only be getting results for gene sets that you've selected.

 

If you're talking about the same heatmap before, that heatmap has nothing to do with the sets that were run, it is only the top and bottom ranked genes in the data.

Gautam Nisha

unread,
Apr 5, 2022, 4:30:58 PM4/5/22
to gsea...@googlegroups.com
Dear Anthony

Thank you for helping with GSEA for so long. I run GSEA successfully and also got very interesting insight regarding analysis. I was wonder if I could generate manually gene set to run against the experimental dataset. Can you please guide how can I generate manually gene sets of my own interest. 
Thanks in advance for your help.

Anthony Castanza

unread,
Apr 5, 2022, 4:45:46 PM4/5/22
to gsea...@googlegroups.com

Hello,

 

I'm glad you've found the results from GSEA informative!

 

Running GSEA using custom sets is pretty easy, assuming you have a number of genes of interest (i.e. a set of differentially expressed genes from an independent study, a clique from WCGNA, or really any group of genes) you can put them into any of our Gene Set Dataset formats, here: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Gene_Set_Database_Formats

The main GMX or GMT formats are mainly intended for multiple sets where each set is a column (GMX) or a row (GMT) in either of these formats the first cell of either the row or the column is used as the set's name, and the second as it's description with the remaining used for the set's members.

There is also the "grp" format which is just a simple text format with the first line starting with a # character and the set name then the rest of the rows being the set members.

 

Once a file in any of these formats is prepared, that file can be loaded into GSEA the same way you load in your expression data or ranked list. Then in the Run GSEA/Run GSEAPreranked window, you'll want to click the […] button next to the Gene sets database dialogue, and in the window that pops up click the "Gene matrix (local gmx/gmt)" tab if you've used one of those formats or click the ">" until you see the Gene sets (grp)" if you used the grp format for your set.

 

One thing to note is that if running only a single set with GSEA the " FDR q-val" statistic isn't generally meaningful the "NOM p-val" will still be valid and will tell you the significance of the enrichment.

 

Let me know if you have additional questions, or encounter any errors during this process

Gautam Nisha

unread,
Apr 5, 2022, 5:16:45 PM4/5/22
to gsea...@googlegroups.com
Dear Anthony 

Sure I will trying this way. And will requiring your guidance also. 

Thanks 🙏 

Gautam Nisha

unread,
Apr 6, 2022, 4:29:52 PM4/6/22
to gsea...@googlegroups.com
Dear Anthony


I have query regarding gmt format. The first column gene set name in gmt format is showing cytogenetic location of gene. How this can be gene set name, kindly clarify.  And can you please clear that why there is number of columns like C D E F and G with gene names? I was wondering how to put my gene list in this format. I have gene list, but I am wondering what should come in column 1 as gene set name. 

Thanks in advance. 

Anthony Castanza

unread,
Apr 6, 2022, 4:36:08 PM4/6/22
to gsea...@googlegroups.com

Hello,

 

In the documentation cytogenetic location is used as an example of a (series of) gene sets. The columns in that example are being used to provide multiple gene sets.

The gene set name you use is entirely up to you, ideally it should be something meaningful as to the origin of the set you're creating and have minimal special characters (I.e. no slashes, colons, dashes, etc).

 

If you only have one gene set you only need one column. The name in the first row of that column should be something you choose for that set, likewise the second row in that column should be a description that can be longer than the set name and include more details. The reset of the rows in that column are the genes that you want to comprise that set (one per row).

Gautam Nisha

unread,
Apr 6, 2022, 4:38:42 PM4/6/22
to gsea...@googlegroups.com
Dear Anthony 

Thank you for the quick response. I will try similar way. 


Nisha 




Gautam Nisha

unread,
Apr 8, 2022, 11:46:12 AM4/8/22
to gsea...@googlegroups.com
Dear Anthony

Thank you for your help at each step. I have successfully loaded the pre-ranked gene set file to the GSEA. But it is showing some error while I run it. I am sending you the screenshot showing error. Can you please hep in this regard.  It will be very helpful to me.

Thanks in advance.

GSEA error.jpg

Anthony Castanza

unread,
Apr 8, 2022, 12:14:46 PM4/8/22
to gsea...@googlegroups.com

It's a little difficult to tell without seeing the Details view of the error, but based on the window in the background it looks like the run might not have been configured correctly. The "Ranked list" box is blank, you should click that box and select the ranked list you're trying to run from the dropdown. Additionally, in the Collapse/Remap to gene symbols" box you've selected "Remap_only" but in the "Chip platform" box, there is no chip file selected. Either "Collapse/Remap" should be set to "None" (not recommended) or the correct chip for your datatype should be selected for the "Chip platform". If you need assistance in selecting the correct chip file, please send a sample of the gene identifiers from the ranked list file.

 

Let me know if you're still having problems after addressing these issues.

Gautam Nisha

unread,
Apr 8, 2022, 1:40:14 PM4/8/22
to gsea...@googlegroups.com
Dear Anthony

Thank you for the quick help.

I am sending you the detailed error and also sending you the sample gmx file.   

Thanks.

geneset.gmx.txt
detailed error.jpg

Anthony Castanza

unread,
Apr 8, 2022, 1:43:53 PM4/8/22
to gsea...@googlegroups.com

As far as I can tell there isn't anything wrong with the GMX file, this error has to do with the ranked list file. Did you try the steps I suggested in my previous email for correcting the issues with parameter values?

Gautam Nisha

unread,
Apr 8, 2022, 2:48:43 PM4/8/22
to gsea...@googlegroups.com
Dear Anthony

Thank you for approving my gmx file. I tried to select the  "Ranked list" from (...) option , but it did not show any file option. For collapse/remap to gene symbol, i select no collapse and for chip, if i select local, it does not show any file as in Ranked list. I am not able to find where I am making mistakes. 

Thanks 



Anthony Castanza

unread,
Apr 8, 2022, 2:52:30 PM4/8/22
to gsea...@googlegroups.com

For running GSEA Preranked, your dataset should be prepared in the .RNK format here: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#RNK:_Ranked_list_file_format_.28.2A.rnk.29

Once prepared in that format, it needs to have the file extension .rnk not .rnk.txt. If the file is not showing up, your operating system may have hidden the .txt extension. You should be able to remove this extra extension from the properties/get info menu. Once that's done, try reloading the file into GSEA and then see if it shows up in that dropdown.

Gautam Nisha

unread,
Apr 8, 2022, 3:02:11 PM4/8/22
to gsea...@googlegroups.com
Dear Anthony


 I need to save the gmx file as rnk file?  Basically I am using three files as input. One is an expression file that is in tab delimited txt (used earlier to run GSEA), second a gmx file (own gene sets file), third is identifiers characters (cls) file. Kindly guide which file I need to change into rnk file or do I need to create a new one.

Thanks


Anthony Castanza

unread,
Apr 8, 2022, 3:07:57 PM4/8/22
to gsea...@googlegroups.com

Ok, I didn't realize this was the same data you'd run before, my answer assumed it was a preranked dataset because the screenshot you sent was of the "Run GSEAPreranked" window.
If this is the same data you'd used before then the issue is that you're trying to run the wrong GSEA function. You've clicked on the "Run GSEAPreranked" tool which is designed for a single column of ranked data in the .rnk format. You'll need to click the standard "Run GSEA" function and your data should show up under the standard "expression dataset" dropdown.

Reply all
Reply to author
Forward
0 new messages