Questions regarding "genes found from clusters" statistics in ClueGO log files

gss2...@gmail.com

unread,

Oct 13, 2016, 6:00:40 PM10/13/16

to cytoscape-helpdesk

Hello all,

I am currently using Cytoscape 3.4.0 with ClueGo 2.2.6 to perform clustering on a list of 972 genes derived from ChIP-Seq data. I've selected several ontologies and found in the log file (pasted below) that only 621 of these genes are found from my cluster using the reference sets from the selected ontologies. What does it mean that only 621 of my 972 genes are found? Does this mean that only 621 of my genes appear in the reference sets for these ontologies? If so, does the lack of these genes in these ontologies mean that they are unable to be functionally clustered by ClueGo even if a more comprehensive reference set is used? If not, what does it mean? In addition, how common is it that a significant proportion of analyzed genes are not found and how is this phenomenon dealt with generally when reporting clustering findings?

Thank you for your time,

Gabe Stephens

Log file:

### All Results were created with ClueGO v2.2.6 ###

Identifier types used: [AccessionID]

Evidence Codes used: [All]

#Genes in WikiPathways_10.02.2016 : 4788

#Genes in REACTOME_10.02.2016 : 7108

#Genes in GO_ImmuneSystemProcess-GOA_07.10.2016_10h10 : 2249

#Genes in KEGG_12.10.2016 : 7921

#Genes in GO_MolecularFunction-GOA_07.10.2016_10h10 : 21023

#Genes in GO_CellularComponent-GOA_07.10.2016_10h10 : 21547

#Genes in GO_BiologicalProcess-GOA_07.10.2016_10h10 : 21286

#All unique Genes: 22327

Total # of Genes from Cluster#1 972, with 0 (0.0%) missing!

#All Genes found from initial Cluster#1 (972.0): 621.0 (63.89%)

#All Genes found from 1 initial Cluster(s) (972.0): 621 (63.89%)

#Genes found from all Clusters after selection: 589 (60.6%)

KappaScore Grouping:

Iteration: 0 with 118 groups

Iteration: 1 with 120 groups

Iteration: 2 with 107 groups

Iteration: 3 with 76 groups

Iteration: 4 with 74 groups

Final KappaScore groups = 74

# Terms not grouped = 0

#GO All Terms Specific for Cluster #1: 256

Ontology used:

GO_BiologicalProcess-GOA_07.10.2016_10h10

GO_CellularComponent-GOA_07.10.2016_10h10

GO_ImmuneSystemProcess-GOA_07.10.2016_10h10

GO_MolecularFunction-GOA_07.10.2016_10h10

KEGG_12.10.2016

REACTOME_10.02.2016

WikiPathways_10.02.2016

Evidence codes used:

All

Identifiers used:

AccessionID

List of missing Genes:

Cluster #1

Statistical Test Used = Enrichment/Depletion (Two-sided hypergeometric test)

Correction Method Used = Benjamini-Hochberg

Min GO Level = 1

Max GO Level = 20

All GO Levels = false

Cluster #1

Sample File Name = File selection: [my list of accessions in a .txt file]

Number of Genes = 1

Get All Genes = false

Min Percentage = 1.0

Get All Percentage = false

GO Fusion = true

GO Group = true

Kappa Score Threshold = 0.5

Over View Term = SmallestPValue

Group By Kappa Statistics = true

Initial Group Size = 1

Sharing Group Percentage = 50.0

alex.pico

unread,

Oct 20, 2016, 12:38:25 PM10/20/16

to cytoscape-helpdesk, gabriel...@crc.jussieu.fr

This looks like a good question for the authors of the ClueGO app. I've cc'ed one of them...
- Alex

Bernhard

unread,

Oct 20, 2016, 2:00:51 PM10/20/16

to cytoscape-helpdesk, gabriel...@crc.jussieu.fr

Hi Gabe,

we verified once again the ClueGO results, and we found out that we have a bug in the log. Thanks for reporting us about this problem.
To get all annotations for your mutation list, I made a ClueGO analysis using GO Biological Process and Molecular Function. I think these sources are covering most (if not all) annotations, including predictions (IEA evidence code). I mapped the identifiers in all GO levels, with at least 1 gene/term, and with all % (there are terms with less than 1% found genes). I included in the network the genes without annotations. See attached the example.
Your list with mutant specific identifiers contained 1047 NM_ transcript ids, some of them were included more than once. The list contains 972 unique NM_ transcript identifiers, which were all recognized in ClueGO (for each of these ids an EntrezGeneID was found).
725 unique EntrezGeneIDs (the main id type used in ClueGO) corresponded to the 972 transcript ids.
Out of the 725 genes, 615 were annotated in BP and/or MF, this representing 84.82% of the found genes. 110 genes (15.17%) have no annotation.
So the results obtained with ClueGO are correct, you can see the network and the corresponding table. The bug affected only the calculation of the number and percentage of found genes reported in the log. The initial calculation reporting the number and the percentage of found genes to unique uploaded identifiers (here 972 ids) is now changed to unique genes (725 genes). In fact, 15% of the 725 genes from your list have no annotations in GO BP and MF. Between these are RIKEN cDNA ids, predicted genes and miRNAs. The 36% indicated in the log was not correct.

Functional enrichment results rely on correctly annotated genes/proteins and on recent ontologies/pathways. For this ClueGO provides the automatic update of annotation and ontology sources. Users can thus have the possibility to analyze their genes/proteins in the context of the latest NCBI gene info, together with up to date GO, KEGG, Reactome, WikiPathway. In general, most of the genes are recognized. We provide also additionally conversion files: e.g. EntrezGeneID to UniProt or Affymetrix that are included in the organism archive or are available to download within ClueGO.

Next, ontologies are more detailed in certain areas, with known genes with established functions that have many functional annotations in many sources. In contrast, other genes have scarce functional information. Ontologies are continuously improved, see for example the GO project.

Best

Auto Generated Inline Image 1

Reply all

Reply to author

Forward