gmt file format Q

46 views
Skip to first unread message

Vincent Lynch

unread,
Jan 23, 2023, 7:07:31 PM1/23/23
to webgestalt
Hi All,

I've noticed that I get different enrichment results when my custom gmt files has different gene nested gene sets, specifically the Ratio and P-values for the same gene sets have different values depending on how many gene sets are included in the gmt file. This is especially notable if my gmt file includes the background gene set. Any idea what is going on here?

Thanks,
Vinny

Yuxing Liao

unread,
Jan 26, 2023, 5:59:48 PM1/26/23
to Vincent Lynch, webgestalt
Hi Vincent,

Could you check in the summary section how many effective genes are used for analysis? The input genes and reference are both filtered for those with gene set annotation. And the input is also intersected with the reference. If your GMT is small, it could have a bigger impact.

Yuxing

--
You received this message because you are subscribed to the Google Groups "webgestalt" group.
To unsubscribe from this group and stop receiving emails from it, send an email to webgestalt+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/webgestalt/cf48b6fd-6ebe-421e-987f-1069ebd5216bn%40googlegroups.com.

Vincent Lynch

unread,
Jan 28, 2023, 8:18:17 PM1/28/23
to webgestalt
Here is the info, what I don't understand though is why the P-value and Ratio changes depending on the number of gene sets in the gmt file. Shouldn't these be independent of each other?

  • Enrichment method: ORA
  • Organism: hsapiens
  • Enrichment Categories: uploads/test_1674832044.gmt ID Type: genesymbol
  • Interesting list: textAreaUpload_1674832044.txt. ID type: genesymbol
  • The interesting list contains 573 user IDs in which 562 user IDs are unambiguously mapped to 562 unique entrezgene IDs and 11 user IDs can not be mapped to any entrezgene ID.
  • The GO Slim summary are based upon the 562 unique entrezgene IDs.
  • Among 562 unique entrezgene IDs, 153 IDs are annotated to the selected functional categories and also in the reference list, which are used for the enrichment analysis.
  • Reference list: uploads/GeneSet HUGO_1674832044.txt ID type: genesymbol
  • The reference list can be mapped to 13100 entrezgene IDs and 13100 IDs are annotated to the selected functional categories that are used as the reference for the enrichment analysis.

Parameters for the enrichment analysis:

  • Minimum number of IDs in the category: 5
  • Maximum number of IDs in the category: 2000
  • FDR Method: BH
  • Significance Level: Top 10

Thanks
Vinny

Yuxing Liao

unread,
Jan 31, 2023, 3:13:12 PM1/31/23
to Vincent Lynch, webgestalt
Limiting the scope to genes with the chosen functional annotation helps to obtain significant results. The size of the reference is fine in this case (13100). You have lost a bunch of input genes from 562 to 153. The ~400 lost ones will never be found in random sampled gene sets if included. P-values would be worse in general.

Yuxing

Vincent Lynch

unread,
Jan 31, 2023, 3:18:44 PM1/31/23
to Yuxing Liao, webgestalt
Thanks, that I get but I still have a couple questions: 1) Why do the P-values change when the number of gene sets in the gmt file changes? (Shouldn’t the P-values be independent of the number of gene sets?); and 2) If I include the background gene set in the gmt file, more genes are found in the intersection of interesting genes and the sets in the gmt. Any idea why?

Thanks!
Reply all
Reply to author
Forward
0 new messages