Problem (controversy) in GENE2FUNC results

Nóra Eszlári

unread,

Jun 27, 2025, 6:23:41 PM6/27/25

to FUMA GWAS users

Dear FUMA Team,

I'm afraid I've encountered some kind of error in my GENE2FUNC results.

Particularly, regarding my job 634264, I noticed some controversies within the GS.txt file and the enrichment figures.

In the GS.txt, for the gene set "GOBP_PROTEIN_LOCALIZATION_TO_

CENP_A_CONTAINING_CHROMATIN", "N_genes" = 4 and "N_overlap" = 3, and indeed, in the figure I can see an overlap proportion of 0.75.

However, number of genes in this gene set ("N_genes") is, in "reality", not 4 but 18, both in the downloadable MSigDB_20231Hs_MAGMA(1).txt (from here: https://fuma.ctglab.nl/tutorial#magma) and in the present version of MSigDB as well: https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GOBP_PROTEIN_LOCALIZATION_TO_CENP_A_CONTAINING_CHROMATIN

And, the "real" number of overlapping genes between this gene set and my mapped (input) genes ("N_overlap") is not 3 but 17, both according to the "genes" column of the same GS.txt and the "overlapping genes" part of the same figure.

And I can see the very same problem (within the same job's results) with the gene set "GOCC_CHROMOSOME_CENTROMERIC_CORE_DOMAIN".

Here the numbers are: number of genes in gene set is "falsely" 5 but truly 19; and number of overlapping genes is "falsely" 3 but truly 17.

Could you, please, help me with that problem? :)

Thank you a lot in advance!

Kind regards,

Nóra Eszlári.

Tanya Phung

unread,

Jun 30, 2025, 6:53:42 AM6/30/25

to FUMA GWAS users

Hi Nora,

FUMA considers only the genes that overlap between the gene sets and in your set of background genes. So in your example, it means that there are 4 genes that overlap between the gene set "GOBP_PROTEIN_LOCALIZATION_TO_

CENP_A_CONTAINING_CHROMATIN" and the set of background genes (and not the number of background genes).

See relevant codes:
https://github.com/vufuma/FUMA-webapp/blob/master/scripts/g2f/GeneSet.py#L37

https://github.com/vufuma/FUMA-webapp/blob/master/scripts/g2f/GeneSet.py#L21-L23

Best,
Tanya

Tanya Phung

unread,

Jul 3, 2025, 4:03:40 PM7/3/25

to FUMA GWAS users

Hi Nora,

I will follow up your reply in this thread here, so that it might benefit others who might have the same questions.

In the example you give, the gene set GOBP_PROTEIN_LOCALIZATION_TO_CENP_A_CONTAINING_CHROMATIN contains 18 genes:
GOBP_PROTEIN_LOCALIZATION_TO_CENP_A_CONTAINING_CHROMATIN http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GOBP_PROTEIN_LOCALIZATION_TO_CENP_A_CONTAINING_CHROMATIN 1058 121504 3012 554313 8294 8335 8359 8360 8361 83628363 8364 8365 8366 8367 8368 8370 8970

As I mentioned in my earlier reply, in GENE2FUNC, it obtains the overlap between the genes in the gene set and the background genes. The file for the background gene is currently not shared on the Download page on FUMA but if you need it, please send an email and I can send it to you.

in R:
library(data.table)

bkgenes = fread("ENSG.genes.txt")

chromatin_gs = fread("GOBP_PROTEIN_LOCALIZATION_TO_CENP_A_CONTAINING_CHROMATIN.txt")

> chromatin_gs
entrezID
<int>
1: 1058
2: 121504
3: 3012
4: 554313
5: 8294
6: 8335
7: 8359
8: 8360
9: 8361
10: 8362
11: 8363
12: 8364
13: 8365
14: 8366
15: 8367
16: 8368
17: 8370
18: 8970

bkgenes_entrezID = unique(bkgenes$entrezID) #since you specify all, no filtering was done

length(bkgenes_entrezID)
[1] 24304

chromatin_gs_entrezID = unique(chromatin_gs$entrezID)

length(chromatin_gs_entrezID)
[1] 18

common = intersect(bkgenes_entrezID, chromatin_gs_entrezID)

common
[1] 554313 1058 8335 8970

Thus, there are 4 genes that are common between the genes in the gene set and the background genes.

Best,
Tanya

Nóra Eszlári

unread,

Jul 4, 2025, 5:51:42 PM7/4/25

to FUMA GWAS users

Dear Tanya,

Thank you a lot for your thorough help! :)

Now I can easily see that FUMA GENE2FUNC hypergeometric test won't switch from Entrez ID to Ensembl ID, nor to Gene Symbol, while analyzing or while defining the intersection between the background genes and the specific gene set.

But it's still very strange to me that the intersection is so narrow between the "all" type of background genes and this specific gene set. It's strange because within the NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene/), each specific Entrez ID of this gene set returns, among other gene name aliases, the exact gene name by which the gene is displayed within this gene set in the "overlapping genes" part of the GENE2FUNC output figure!

Maybe this "ENSG.genes.txt" could be based on some prior curation, which may represent some kind of stability of the particular gene (position?, ENSG ID?) between the GRCh37 and GRCh38 assemblies, or something like that?? All of my mapped genes within this gene set are histone genes. :D

Sorry for asking so much! :D

Thank you a lot in advance,

Nóra.

Tanya Phung

unread,

Jul 7, 2025, 6:02:19 AM7/7/25

to FUMA GWAS users

Hi Nora,

The file ENSG.gene.txt file that is used by FUMA was generated by a colleague who no longer works on FUMA.
Here you can find the documentation on how he created this file: https://github.com/vufuma/FUMA-data/blob/main/ENSG/updateENSGv92.txt

Here you can find the file ENSG.gene.txt: https://github.com/vufuma/FUMA-data/blob/main/ENSG/ENSG.genes.txt

Best,
Tanya

Nóra Eszlári

unread,

Jul 11, 2025, 8:48:51 PM7/11/25

to FUMA GWAS users

Dear Tanya,

Thank you a lot for your thorough help! :)

I compared the 18 gene Entrez ID-s of this gene set ( GOBP_PROTEIN_LOCALIZATION_TO_CENP_A_CONTAINING_CHROMATIN ), from the GO_bp.gmt file used for GENE2FUNC hypergeometric test, with both:

ENSG.genes.txt file used as the background genes; and
my SNP2GENE result genes.txt file (mapped genes of my GWAS, used as input genes for GENE2FUNC).

And I can see similar (although not the very same) discrepancies with my SNP2GENE mapped genes as well: the same Entrez ID have two matching gene symbols, the 14 of the 18 Entrez ID-s not present at all (but here this number should be 1 if based on gene symbol: the sole gene of the gene set not mapped in my GWAS), etc.

So I think that the discrepancy may also be present between SNP2GENE results and the .gmt file used to define gene sets for GENE2FUNC.