Hello,
The more recent answer is the correct one. ssGSEA needs to have the gene length count bias corrected for. TPM is the best way to correct for that bias. ssGSEA calculates enrichment within an individual sample, therefore, the biases within that sample are the most important to account for. The count-based normalization methods do not account for the length bias at all.
There are definitely concerns with comparing TPM across samples directly, a good summarization of this is here: https://rnajournal.cshlp.org/content/early/2020/04/13/rna.074922.120.full.pdf
However, many of the concerns they mention aren’t TPM specific, but rather largely general concerns for comparing between wildly differing samples.
For data processed through a common pipeline like CCLE, potential technical variance issues are lessened substantially.
For the “sample normalization method”, with TPM data, we generally recommend leaving it on the default “none”, the other options are for specialized use cases.
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "GenePattern Help Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
genepattern-he...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/genepattern-help/ff966767-2851-4e6d-b983-618da7e824e9n%40googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "GenePattern Help Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/genepattern-help/79Oc3JpLrIU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to genepattern-he...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genepattern-help/BYAPR05MB578200C2FCA0684359806C8CF75E9%40BYAPR05MB5782.namprd05.prod.outlook.com.
Hi Sheila,
I don’t have an easy answer for you. Batch effects are a complicated thing to deal with even when using standard gene-level analysis pipelines, and none of the tools are designed to work on TPM data since the standard pipeline ends at gene-level differential expression, or between phenotype group enrichment analysis, which is count-based.
I might suggest running ssGSEA first, then clustering your ssGSEA results to see if the data is still clustering by batch at the level of gene sets. If it is, then yes, you’ll probably have to perform some batch correction.
Mike Love, the author of DESeq, answered a similar question about batch correcting TPM on biostars (here: https://support.bioconductor.org/p/107760/)
His suggestion is to remove the mean batch shift from log2(TPM) data. I’d suggest actually *un*Log-ing it at the end for ssSGEA however (because of the way ssGSEA’s internal math works). I think that’s basically what you’re proposing to do. To use `svaseq` (the package that combat is part of) to get the batch variables and then use those with removebatcheffect.
I can’t tell you in advance if it’s going to work, the only thing you can do is try it and see if your data clusters reasonably well before running ssGSEA. If it works, then you should have removed the batch effects, but preserved the length-bias removal of the TPM conversion.
I don’t actually remember when we’d recommend using the “sample normalization method” parameter. I believe it is one of the options that was implemented for using microarray data with ssGSEA.
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
To view this discussion on the web visit https://groups.google.com/d/msgid/genepattern-help/CACg8vLvrvmjia8WDZkyd56TV8AfJrttV6eyhw%2BMoAA2QiXJGSQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genepattern-help/BYAPR05MB5782F349130BACC91D41B5EAF75E9%40BYAPR05MB5782.namprd05.prod.outlook.com.
Hi Sheila,
I don’t think we have any specific examples of the combine function. That said, it’s pretty simple.
The combine mode works by looking for gene set pairs with the suffix _UP and _DN and an otherwise identical gene set name. Typically these are gene sets that come from the Upregulated and Downregulated sides of a single experiment. If “combine.replace” is set, it will merge the _UP and _DN gene sets and just use the single combined set instead of the separate up and down sets. If combine.add is set, it’ll test the merged set in addition to the separate _UP and _DN sets. (combine.off tells it to not do any of this).
The documentation (https://gsea-msigdb.github.io/ssGSEA-gpmodule/v10/index.html) describes this in a little more detail.
There generally isn’t any harm in running in combine.add mode, if any sets paired in this way are detected, you’ll just get an extra gene set in your results, but if there aren’t it has no effect.
To view this discussion on the web visit https://groups.google.com/d/msgid/genepattern-help/CACg8vLuxHoTqHu3us0xfwv%3DQmNs1KoX8E%2BTHtumC4FSFErnU2A%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genepattern-help/BYAPR05MB578282794DBAE38901866D09F75B9%40BYAPR05MB5782.namprd05.prod.outlook.com.