Hi,
By default, an Ensembl ID will be omitted from the CHIP if it does not have both an assigned gene symbol, and an assigned NCBI gene ID. Genes missing this information will also never be found in a gene set. However, since these genes can still have an impact on the gene ranking, to do allow these ensemble ids to be passed through the collapse operation “as-is”. This option is under “Advanced fields” called “Omit features with no symbol match”, by default, this is set to “true” which restricts the expression set to just genes that fit the Ensembl/HGNC/NCBI consensus criteria, setting this to “false” will include the Ensembl IDs. We think that omitting these genes is generally fine as they are generally lower quality constructs (genes with less strong evidence of being “real” as they are only present in Ensembl’s transcriptome model and not in the NCBI model), but switching this parameter to “true” will give you a technically more accurate depiction of how your enrichment compares to all the constructs that ensemble says are possible.
If this doesn’t address your question, can you give an example of an Ensembl ID that was omitted that you think shouldn’t have been?
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/87137502-2bcf-468c-bc0d-1df63a94a3ean%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/031801d75657%2467946180%2436bd2480%24%40cloud.ucsd.edu.
Hello,
27 samples is well within the range of what we’d consider an acceptably large dataset (assuming they’re divided somewhat equally into two phenotypes), so for this signal2noise would be the recommended metric.
There are a couple issues here though, 2000-2500 genes in the input dataset is well below what we’d recommend. GSEA is designed to be run on all the expressed genes, and usually indicates some sort of significance filtering which isn’t a part of the recommended GSEA data processing pipeline. We also don’t recommend using log2 transformed data generally, the input you’d want for GSEA is some sort of between-sample normalized counts (various methods exist to do this normalization, such as options in DESeq2 which are enabled by default in the GenePattern.org and usegalaxy implementations of DESeq2)
When you say “I prepared the genesets from BioMart for zebrafish” do you mean you prepared gene sets for the gene sets database input, or you produced orthology conversion chips to be used with MSigDB gene sets database files?
The “omit” parameter is only applicable when collapsing the dataset with a chip file, if you’ve done orthology conversion I recommend leaving this as-is (false).
I should also note that we don’t officially offer support for zebrafish so I don’t know how much I’m going to be able to help with specific details.
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAAqCwqAS6BwXjg%3DHxaBwtcowpetQ9c4nHpXwF7fmEn-9hf2L3g%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/BYAPR05MB5782D6052C4985F04C70B855F73E9%40BYAPR05MB5782.namprd05.prod.outlook.com.
I see, I’m not super familiar with proteomics pipelines or the normalization scheme you’ve described, but my suggestion would be to use the non-log transformed normalized data with signal-to-noise ratio as the ranking metric. If you only have the log transformed value for each sample, It should be fairly trivial to unlog the sample-by-gene matrix without affecting the global normalization that was done.
One other suggestion is that, since you have so few genes, you might lower the min size parameter under basic fields to “10” or “5”, this will change the threshold that GSEA uses to reject gene sets that don’t have enough members present in the input assay. Normally we don’t recommend changing this parameter too much, and I’d be cautious about sets that small, ,but with only 2000ish genes, you might have to.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAAqCwqApPdj%2BrtWYxG%2Bb-EqnVmJMQ_4uintZtMJi39Z015bwUg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/BYAPR05MB578272AEB70267DA579142FCF73D9%40BYAPR05MB5782.namprd05.prod.outlook.com.
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAAqCwqApoVYmki90_%2B1UuaiE7EPf0Xvp%2Bb9cX-rxHjTspa3GVQ%40mail.gmail.com.