uniqID and indels

101 views
Skip to first unread message

Matthew Maher

unread,
May 4, 2024, 6:45:55 PM5/4/24
to FUMA GWAS users
Is the intended use of FUMA generally restricted to simple SNPs, as opposed to Indels?

I've received from a colleague a results file of output from FUMA and the column 'uniqID' contains four colon-separated values which I initially assumed were the usual CHR:POS:REF:ALT.  But after some confusion due to not recognizing some variants, I saw in the github README

uniqID : Unique ID of SNPs consists of chr:position:allele1:allele2 where alleles are alphabetically ordered.

The alphabetical is okay for simple-SNPs, but for Indels, this is often ambiguous and uninterpretable. 
For example: Is the variant with uniqID "2:69908498:A:AT" (hg19)  referring to the known variant 2:69908498:A:AT (an insertion of a T) or the other known variant 2:69908498:AT:A (a deletion of a T)?  I don't believe there's any way to tell.  And the associated RSID is no help as it's just as ambiguous.

I see that FUMA's input screen never asks about REF/ALT alleles, so I suspect there really isn't much currently FUMA can do, except return the ambiguous IDs in the case of Indels. 

Or am I missing something?  or is the onus on whoever submitted the job (and thus knows - or should know - what variants were submitted) to remap the uniqID values to CHR:POS:REF:ALT

Thanks for any information.

Matthew Maher

unread,
May 4, 2024, 7:26:30 PM5/4/24
to FUMA GWAS users
Actually, I wrote/spoke a bit too quickly when I wrote:  "there really isn't much currently FUMA can do, except return the ambiguous IDs in the case of Indels."    There IS something FUMA could do:

FUMA know which allele was Effect and which was Non-Effect.  The submitter generally knows if their Effect == ALT or possibly their Effect == REF.  So IF FUMA simply constructed all uniqID as chr:pos:EA:NEA or constructed all uniqIDs as chr:pos:NEA:EA possibly with a checkbox choice to the user, then the submitter would always be able to interpret the result IDs unambiguously.  It is the scrambling of the alleles into alphabetical order that I believe causes a loss of information and an ambiguous result. 

Marijn Schipper

unread,
May 21, 2024, 3:26:15 AM5/21/24
to FUMA GWAS users
Dear Matthew,

Thank you for your questions, you make a fair point that due to the reordering the alleles you cannot determine which was the effect and non-effect allele.
In FUMA, the specific SNP (e.g. A vs T at a given genomic location) and there is no testing that needs the direction of effect. Doing it this way allows for case-control GWAS to be coded either way in effect allele vs non-effect allele.
When considering the SNP that you mention it is good to specify that for insertion deletions, given that they start at the same basepair, they cannot really be ambiguous. 
In the example that you give, both variants are exactly the same, the only difference is what you consider to be the reference. The links in the GnoMAD browser for example display the same rsID for both variants.
If you want to be sure that no scrambling takes place, mapping your variants to rsid, and uploading to FUMA without basepair and chromosome coordinates will force FUMA to run on rsID, giving you more control of the mapping that was performed.
I hope that this answers your question. If you feel like there is a problem with my reasoning, please let me know, I'm happy to continue discussing.
Best,
Marijn

Op zondag 5 mei 2024 om 01:26:30 UTC+2 schreef Matthew Maher:

Matthew Maher

unread,
Jun 11, 2024, 2:52:30 PM6/11/24
to FUMA GWAS users
Belatedly responding on this:  

Thanks for the feedback Marjin, but I must disagree with this section that you wrote:

When considering the SNP that you mention it is good to specify that for insertion deletions, given that they start at the same basepair, they cannot really be ambiguous. 
In the example that you give, both variants are exactly the same, the only difference is what you consider to be the reference. The links in the GnoMAD browser for example display the same rsID for both variants.


I would first clarify that an rsID does not uniquely identify a (biallelic) variant - it really identifies a location, at which there could be any number of biallelic variants, each one being a different variance from the genome reference.  In my example (the two links to GnomAD), examining the population frequencies should make clear that these are two very different variants.   If you click through the RS# hyperlink and head to the dbSNP resource, you'll find it got merged to another RS#, but eventually you'll see here that that RS# actually covers 14 (if I counted correctly) different variants! 

I suspect the confusion stems from your referring to "what you consider to be the reference".   The 'reference' allele (never to be confused with the 'effect' allele) is determined by the published genome reference - it's not something I decide.  I think
it is illustrative to click on the UCSC link on  either of those GnomAD pages, and you can see the human genome reference sequence at top which shows, starting at position
69908498, an 'A' followed by 16 Ts.  (if I counted correctly).  And thus:

 the variant:
2-69908498-AT-A  results in a person having an A followed 15 Ts (one less T)
 the variant: 2-69908498-A-AT results in a person having an A followed 17 Ts (one more T)
which is why these are different variants, and the allele order is critical, when indels are involved.

So circling back to my original question:  is it possible to allow (perhaps optionally) the disabling of the allele re-ordering in the FUMA output?  If that were allowed, then a user of FUMA would know whether they supplied input data with Effect==REF or Effect==ALT, and thus it would possible to unambiguously determine the variants from the FUMA output.  Otherwise, for the reasons I've stated, I'm afraid all Indels becomes ambiguous and cannot be used.

Thanks
-Matt
Reply all
Reply to author
Forward
0 new messages