Functional Enrichment Analysis

133 views
Skip to first unread message

Panagiotis

unread,
Jan 6, 2025, 6:35:27 AMJan 6
to webgestalt

Hi,

I recently performed a functional enrichment analysis (GO enrichment) with a gene list of approximately 460 up-regulated genes. Some of the results returned broad GO terms that seem overly general, such as:

  • GO:0003674 (molecular_function)

    • Total Genes in Category: 17,339
    • Genes in My List: 290.71
    • Enrichment Score: 1.4860
    • p-value: 3.3292e-56
    • Adjusted p-value: 7.2094e-53
  • GO:0008150 (biological_process)

    • Total Genes in Category: 16,777
    • Genes in My List: 281.29
    • Enrichment Score: 1.5145
    • p-value: 5.9605e-56
    • Adjusted p-value: 1.1915e-52

These terms are quite general and include a large number of genes, so I expected their p-values to be closer to 1, as they likely don’t specifically reflect the genes in my list. However, I'm seeing extremely low p-values, suggesting strong enrichment despite their general nature.

I also realized that I did not select the non-redundant categories during the analysis, which may have led to these broad and overlapping terms. Could you clarify if this is expected behavior in WebGestalt when using broad GO terms, or could there be an issue with how the tool is treating these terms?

Additionally, what would be the best approach to focus on more specific, biologically relevant categories and avoid overrepresentation of overly general terms in the results?

Thank you!

Panagiotis

unread,
Jan 7, 2025, 12:51:08 PMJan 7
to webgestalt

  To ensure the accuracy of the analysis, I also used another gene list containing 739 genes ( see attached .txt file) with the following specifications on WebGestalt:

Redundancy Removal:
  • [✔] Weighted Set Cover (fast)
  • Affinity Propagation
  • k-Medoid

Category Parameters:

  • Minimum number of analytes for a category: 2
  • Maximum number of analytes for a category: 20,000

Statistical Adjustments:

  • Multiple Test Adjustment: Benjamini-Hochberg (BH)
  • Significance Level:  FDR : 0.05

Clustering and Visualization:

  • Number of categories expected from Set Cover: 100
  • Number of clusters (k) for k-Medoid: 100
  • Number of categories visualized in the report: 40
Method and Functional Databases of Interest

Method of Interest:

  • Over Representation Analysis (ORA)

Select Reference Set:

  • Uploaded reference genome ( also attached )

Organism of Interest:

  • Homo sapiens

Functional Databases:

  • Gene Ontology: Cellular Component, Biological Process, Molecular Function
  • Pathway Databases: KEGG, Panther, Reactome, WikiPathways
  • Network Databases: Transcription Factor Target, miRNA Target
  • Disease Database: DisGeNET
  While the analysis highlighted terms that were expected, some of the identified terms (shown below) were broader or redundant, yet still showed high statistical significance, as illustrated in the examples below:
  • Gene Set          | Description                               | Size  | Expect | Ratio   | P Value     | FDR
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • GO:0003674        | Molecular Function | 17339 | 231.34 | 1.0720 | 0.00032169 | 0.020706
  • GO:0043231        | Intracellular Membrane-Bounded Organelle  | 11837 | 157.93 | 1.4437  | 8.0332e-22  | 5.1176e-19
  • GO:0050794        | Regulation of Cellular Process            | 10698 | 142.73 | 1.3732  | 6.4466e-12  | 1.7601e-9
Is this a normal outcome ? Shouldn't these broad terms be identified and managed accordingly ? 
gene_list.txt
Reference Genome .txt

John Elizarraras

unread,
Jan 16, 2025, 12:16:52 PMJan 16
to webgestalt
One set of parameters that are probably creating this is the

Category Parameters:

  • Minimum number of analytes for a category: 2
  • Maximum number of analytes for a category: 20,000
The maximum number of analytes for a category is set to 2,000 as a default. I would recommend lowering this value to remove these large categories that you are seeing. This would enhance the ability to find more specific categories. You can explore different values, but I think 500 might be a good starting point if you are interested in more relevant and specific categories. You can also use affinity propagation for redundancy removal which is more advanced than the weight-set cover method.

Since the categories you see are large, the p-value is going to more significant even for a fairly small enrichment ratio.

Let me know if you have any more questions.

Best,
John
Reply all
Reply to author
Forward
0 new messages