ClueGO, Preselected Functions, columns "Genes Cluster #1"... bug?

Jan Söderman

unread,

Apr 27, 2020, 6:07:48 AM4/27/20

to cytoscape-helpdesk

Hi,

I have used the Preselected function of the ClueGO software to visualize relationships between a set of GO terms. However, I have some concerns regarding the genes present in the columns "Genes Cluster #1" and "Genes Cluster #2" of the result table.

I provide an example below that seems incorrect. For the analysis I used a Kappa score of 0.2 which resulted in 47 groups (104 GO terms). In the analysis the GO term "G0 to G1 transition" (GO:0045023) is placed in its own group (22) and its child term "regulation of G0 to G1 transition" (GO:0070316) is placed in another group, also by itself. This seems incorrect since it is a parent-child relationship. For "G0 to G1 transition" ClueGO report only two genes, whereas for "regulation of G0 to G1 transition" a large number of genes are reported (see below). Is this a bug or is there another explaination for this behaviour?

What is the origin of the genes of the columns "Genes Cluster #1" and "Genes Cluster #2"? If these genes are used for the grouping of GO terms then it is critical that the gene sets are correctly retrieved.

ClueGO
"G0 to G1 transition": [CDK3, CDKN3]
"regulation of G0 to G1 transition": [ABI2, ACTRT1, AGXT, APAF1, APC, ARL6IP1, BAP1, BIRC3, BMI1, BRCA1, CBX3, CBX5, CDC7, CHEK1, COMMD3-BMI1, DAB2IP, DUX4, DUX4L1, DUX4L10, DUX4L2, E2F1, E2F6, EED, EHMT1, EHMT2, EPC1, EZH2, FOXO4, GOLGA6A, HLA-G, L3MBTL2, LOC107987484, LOC107987485, LOC107987486, LOC107987487, MAGI1, MAGI2, MAX, MGA, MGAM, PCGF2, PCGF6, PDCD6IP, PHC1, PHC3, PTGDR, PTGDR2, RAD51, RBBP4, RBBP7, RBBP8, RCBTB1, REEP5, RHNO1, RING1, RNF2, RRM2, RYBP, SOX12, SOX15, SUZ12, TFDP1, TFDP2, UXT, WDR1, YAF2].

Originally the GSEA was conducted using clusterProfiler (R software) with clearly overlapping set of genes contributing the GO term enrichment.
"G0 to G1 transition": MDM4/RHNO1/APAF1/RAD51/BRCA1/RBBP8/EPC1/CHEK1/RYBP/RBBP7/EED/BMI1/EZH2/MGA/MED1/YAF2/TFDP2/SUZ12/RNF2/CBX5/PHC3/RBBP4
"regulation of G0 to G1 transition": APAF1/RAD51/BRCA1/RBBP8/EPC1/CHEK1/RYBP/RBBP7/EED/BMI1/EZH2/MGA/MED1/YAF2/TFDP2/SUZ12/RNF2/CBX5/PHC3/RBBP4

Sincerely,
Jan

Scooter Morris

unread,

Apr 30, 2020, 11:06:23 AM4/30/20

to cytoscape-helpdesk

Hi Jan,

I'm forwarding this off to the ClueGO authors.

-- scooter

Bernhard

unread,

Apr 30, 2020, 2:34:40 PM4/30/20

to cytoscape-helpdesk

Hi Jan, since you use the option with Preselectes functions, ClueGO does not now your initial gene list you got these terms with. So depending on the threshold of max allowed genes per term (you can set in the CluPedia options) you will either get all genes from the term or none. So the genes you mentioned above are all the genes associated to your two terms. Also the Grouping is difficult because all or no genes are added to the terms and it is difficult to find the right grouping options to see a correct grouping. Either lower the kappa-score threshold or modify grouping options. But the best would be if you use the GO relation view which always shows the relation directly from GO. One could also think about of an option to add the genes that where used to get the GO terms by enrichment (unfortunately we have no time for this at the moment), or you could try to directly do the enrichment with ClueGO. It is not exactly GSEA but the result would not be much too far from it. Hope this helps.

Best

Jan Söderman

unread,

May 2, 2020, 5:29:37 AM5/2/20

to cytoscape-helpdesk

Jan Söderman

unread,

May 2, 2020, 5:34:11 AM5/2/20

to cytoscape-helpdesk

Hi,

Thank you Scooter for forwarding my questions. Thank you Bernhard for your reply.

However, I struggle to understand the ClueGO result. Therefore thankful if you have any additional insights to share.

I have now re-run the analysis. This time I ensured that the genes per term threshold was set above the maximum size for gene sets used for the GSEA (550 vs. 500). Just in case I also updated the ontologies in ClueGO.

The result is the same. ClueGO only report the same two genes [CDK3, CDKN3] for "G0 to G1 transition". As far as I can tell this term contain 48 human genes (http://amigo.geneontology.org), so it should not be filtered away by ClueGO (also, ClueGO report two genes, not zero). For its child term "regulation of G0 to G1 transition" ClueGO still reports 66 genes, whereas AmiGO2 reports 46 human genes for the term. I also tried to set a lower Kappa score, but in this case it won't matter since there are only two (out of 48) genes present in ClueGO, and these are not even part of the child term.

Concerning the use of ClueGO, I did start out using it for enrichment analysis but found it quite difficult to identify an appropriate parameter setting as well as an appropriate number of differentially expressed genes to investigate.

In order to avoid specific parameter settings depending on the analyzed contrast I tried to find a common set of parameters that work for both "non-inflamed control tissue" vs. "non-inflamed disease tissue" (few DE genes) and "non-inflamed control tissue" vs. "inflamed disease tissue" (many DE genes), and that also allow for separate clusters for up-regulated genes (many) and down-regulated genes (fewer).

The settings resulted in either "Algorithm did not converge" or a result that seemed to lack known biological processes involved in the disease pathogenesis (especially for down-regulated genes). Also, as compared to up- and down-regulated genes analyzed by themselves, an analysis based on two clusters drowned out GO terms associated with down-regulated genes (i.e. the terms lacked from the result).

Because of this I tried GSEA using clusterProfiler instead. However, the ability to visualize the results and restructure the visualization is very limited in the clusterProfiler/enrichplot packages.

Sincerely,
Jan

Den torsdag 30 april 2020 kl. 20:34:40 UTC+2 skrev Bernhard:

Bernhard

unread,

May 12, 2020, 2:49:11 PM5/12/20

to cytoscape-helpdesk

Dear Jan,

thanks for bringing up this issue.

We have verified again all the source files, and algorithms. Here some thoughts:

We automatically get the data from GOA site: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/25.H_sapiens.goa

In this file for each GO term are associated genes, usually one per line. Sometimes the the gene ID is followed by several records, that supposed to be synonyms. To cover as much as possible info, we initially took all these synonyms into account (theoretically translating to the same EntrezGeneID), not to miss some gene due to conversion problems. We noticed that in some cases (like in the example you sent us), these genes are not all synonyms (e.g. CDK3 and CDKN3), and decided to keep just the first gene id. For this reason in your example the two genes you should have associated with "G0 to G1 transition" are: MYC and CDK3.

Another issue are the term-term relations in GO. You mentioned correctly that parent-child term pair, but in the same time we have to consider the type of relation between these terms, and "regulation of..." is a regulator term. Initially in such type of interrelation the genes of the child were not added to the parent, as recommended in GOA site "it would not be safe to include these genes". Now I see that this message is a bit modified, and allows to include them as associated with the parent term for enrichments.

"Unlike is a and part of, grouping annotations to gene products grouped via regulates changes the relationship between the GO term and the gene product over the is a and part of relations. If gene product X is annotated as involved in a process that regulates glycolysis, it would not be correct to conclude that X participates in glycolysis. Nevertheless, some tools use regulates relations to group annotations. This can be useful for gene-set enrichment. The resulting gene sets include genes that are involved in processes that are causally related to the grouping term."

GO relations: http://geneontology.org/docs/ontology-relations/

Considering all this, by default we decided now to add the genes up the parent if there is a "regulates" type of relation. In addition, we added the possibility to switch this off if the user prefers it. This can be done in .properties file for each organism. After this feature is enabled/disabled, the GO files have to be updated. To disable this option again please set 'allow.relations.as.parents=false'

It is in the updated version (2.5.7) here in the property file.

#enable.add.all.child.genes.to.parent=false
# allow.relations.as.parents=false
#allow.empty.terms=true

If the option is un-commented by # it is considered automatically 'true'.

Also 'enable.add.all.child.genes.to.parent=false' would only allow direct term associations and would not add child associations to the parent. By default this is 'true'

Best

Reply all

Reply to author

Forward