creating a gene set for non-model organism

Atshaya

unread,

Feb 12, 2024, 8:04:54 PM2/12/24

to gsea-help

Hi there,

I have my proteomics dataset ready for statistical analysis, and I'm currently exploring different software options to see which works best for my dataset. During this process, I came across gene-set enrichment analysis software. My species of interest is Atlantic salmon, which is considered a non-model organism compared to humans and mice.

I'm aware that GSEA offers a wealth of gene sets for humans and mice, but unfortunately, I couldn't find any specific to salmon on the different platforms I searched.

I have a few questions:

1) Since my data is from proteomics and uses protein accession numbers, I believe the software may not recognize them directly. So, I attempted to map my 6000 protein IDs to Ensembl IDs using Biomart. However, I could only map 2500 of them. I'm uncertain if this approach is correct, especially considering the loss of potentially meaningful information.

2) My main challenge lies in preparing gene sets for my data. As far as I understand, creating gene sets involves gathering enough information about genes contributing to specific biological pathways in the organism of interest. For example, if I'm comparing the oxidative phosphorylation pathway between two sample time points, I would need to compile a gene set related to this process based on previous literature evidence. Could you confirm if my understanding is accurate?

I even tried to use gene sets from closely related species such as zebrafish, unfortunately, I couldn't retrieve the gene sets from any platform. Do you have any recommendations about what platforms I could search for the gene sets?

Given the limited number of Ensembl IDs in my expression dataset and the difficulty in preparing gene sets for my research model, what would you suggest as the best approach for my situation?

I'm looking forward to your response

Thanks in advance

Anthony Castanza

unread,

Feb 15, 2024, 2:31:46 PM2/15/24

to gsea...@googlegroups.com

Hi Atshaya,

The loss of genes from the translation to Ensembl IDs is definitely an issue, 6000 genes is already on the low side for an approach like GSEA, 2500 going to be very difficult to get reliable results. Your best bet is definitely going to be an ortholog based analysis, but to do this you're going to want to preserve as many genes as possible. Based on the gene set files I'm able to locate/provide your best bet for analysis is going to be in the namespace of Danio rerio NCBI Gene IDs, not Ensembl IDs unfortunately.

You might have to reduce the min size threshold parameter for GSEA, but in your results you'd need to be very very careful about only looking at sets that have a reasonable proportion of the set included in the data.

Yes, that is roughly the process for compiling a gene set. Typically for canonical pathways like that you can rely on data from pathway databases, although since you have a non-model organism you would likely need to use a model organism pathway database with some form of ortholog conversion.

WikiPathways does have a current, GSEA compatible, pathways file for Zebrafish: https://data.wikipathways.org/current/gmt/wikipathways-20240210-gmt-Danio_rerio.gmt

It uses NCBI gene IDs, so if you are able to map your proteins to current NCBI gene IDs, then convert them to the corresponding Zebrafish gene IDs you might have reasonable results with this file.

I was also able to generate the attached Gene Ontology gene set files for Zebrafish

They're based on NCBI Gene IDs, like with the WikiPathways data, and Gene to GO assignment data from 2024-02-14, with the GO annotations from 2024-01-17 (the latest data sources for both as of the time of this email). The files themselves are also in the namespace of Danio rerio NCBI gene IDs. This might be a slightly different version of the underlying NCBI gene ID data than the WikiPathways data, but it should be relatively close. These aren't quite up to the standards for what we'd include in MSigDB (lacking the similarity filtering that we'd normally apply) but they should be serviceable enough provided you can get a sufficient ortholog conversion for your dataset.

Sorry I couldn't be of more help here, and about the delay in getting back to you. Let me know if you have any more questions though

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/6e65194f-d2c8-4e27-92eb-044ab660cd95n%40googlegroups.com.

GO_CC_Danio_rerio_ezid.gmt

GO_MF_Danio_rerio_ezid.gmt

GO_BP_Danio_rerio_ezid.gmt

Nebula Nebula

unread,

Feb 19, 2024, 1:21:52 AM2/19/24

to gsea...@googlegroups.com

Hi Anthony

Good day!

Thank you for your detailed response and for recommending ortholog conversion. I greatly appreciate your assistance in providing the gene sets for zebrafish.

Following your suggestion, I successfully mapped my proteins to current NCBI gene IDs using DAVID bioinformatics and then converted them to the corresponding zebrafish gene IDs using NCBI. However, during the conversion process, the total number of genes was reduced to 1851 out of the 7796 genes initially considered. Do you believe this reduction is acceptable for my expression dataset?

Furthermore, since my data isn't transcriptomics-based, I plan to use these 1851 gene IDs in place of probe set IDs. Additionally, I intend to include two columns for gene symbol and gene title to create a chip file format. Could you please confirm if this approach is correct?

Lastly, could you advise if any adjustments are needed in the default settings at the basic and advanced fields for my dataset to run successfully? (Screenshot attached)

Looking forward to hearing from you

Have a great day

Cheers

Atshaya

You received this message because you are subscribed to a topic in the Google Groups "gsea-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gsea-help/bENlPAdZRU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAGCeyZyOv-ovAuCwwXqkAhCSutTy2FK0Q%3Dwp%3Ddo4F7F1nriW3A%40mail.gmail.com.

Screenshot 2024-02-19 170923.png

Screenshot 2024-02-19 171447.png

Anthony Castanza

unread,

Feb 20, 2024, 1:13:48 PM2/20/24

to gsea...@googlegroups.com

Hi Atshaya,

A gene list reduction of that magnitude is definitely not ideal. Have you tried any other tools for creating the ortholog mapping? If you send me a sampling of the original protein IDs you're using I can attempt to create a mapping file in the correct namespace using Ensembl Biomart Data directly that you can supply as a chip file for GSEA.

That said, I don't know that another approach is going to do a whole lot better. These kinds of conversions from non-model organisms are never ideal, so you're going to have to take any results you get here from GSEA with a pretty big grain of salt, and in any publications be very careful to disclose the limitations with the mapping and specifics of the final, post mapping, gene set membership for any significant results.

For your data where you've already converted to orthologous Zebrafish genes in NCBI gene IDs you shouldn't need to supply any additional CHIP files.

I don't know how your data specifically is formatted, but typically proteomics data is supplied as some sort of abundance metric to the GSEA Preranked mode as a ".rnk" file.

For an input dataset this small, you'd probably want to decrease the min size parameter to 10 genes to help with getting results.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CANn7bYmAWk1YMSCPcVoS%2BQpKCdndQ_9nbGErasDVVanqkC52fQ%40mail.gmail.com.

Message has been deleted

Anthony Castanza

unread,

Feb 21, 2024, 1:28:13 PM2/21/24

to gsea...@googlegroups.com

Hi Atshaya,

The case of not needing a chip file was for where you'd already done the ortholog conversion.

If you want to try supplying the UniProt IDs directly to GSEA, I prepared the attached chip file, this should work to translate your UniProt IDs to the Zebrafish NCBI IDs that were used in the gene set database files I previously sent you.

You would need to load and select this file then enable the Collapse/Remap parameter to use this.

One difficulty I see with your dataset is that you have cases where there are several UniprotIDs concatenated with semicolons in your data. GSEA can't handle this as-is, but I'm not entirely sure the best way to split these apart. One suggestion would be to via a script split the IDs on the semicolon and then divide the abundance equally by the number of IDs. If you do that then the best collapsing mode to use might be "sum_of_probes" (advanced fields parameter "Collapsing mode for probe sets => 1"). I feel I do need to reiterate here that we're way beyond the bounds of the "supported" pipelines for running GSEA though.

From the screenshot you sent me it looks like you might have multiple abundances? If you have multiple samples and are trying to do some sort of differential expression for differential enrichment then the previous "preranked" method I suggested probably wouldn't be what you want to use. That would be better suited for standandard GSEA.

Let me know if you have more questions, or encounter additional error messages, but any deeper debugging is likely to require access to your data files and would likely be handled best over the private gsea...@broadinstitute.org address rather than the public forums.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

On Tue, Feb 20, 2024 at 8:42 PM Nebula Nebula <atshayas...@gmail.com> wrote:

Hi Anthony

Many thanks for your valuable suggestions and Time. I have attached my original protein IDs (for Salmo salar) in txt format. I highly appreciate your effort for giving it a try.

From your email, It's clear that I don't need any chip files for my dataset because of the orthologous conversion.

My proteomics dataset is formatted in a way that in the place of protein ids ( For preparing the rnk file, I believe , I will be using the corresponding ortholog gene Ids and you are correct I have a protein intensity value for each of the proteins identified)

looking forward to your response.

Have a great day!

Cheers
Atshaya

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CAGCeyZwgaY01vNp7E4etzTtvNQEwBCQWbcQC%2BgYC%3DFXRa%2BSHiw%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/CANn7bYncqq49dBRfZ8sJoaqj-Rfw-xj%3Dx0sCMNtDQYTW4mrFYw%40mail.gmail.com.

Salmon_UniProt_to_Zebrafish_NCBI_ID.chip

Reply all

Reply to author

Forward