Clarification on SNP.txt outputs "nearestgene"

155 views
Skip to first unread message

chris....@c4xdiscovery.com

unread,
Jul 9, 2018, 9:43:42 AM7/9/18
to FUMA GWAS users
Dear Kyoko,

Thank you for the previous answer to my question, it was most helpful.

I have been going through the text data files and notice one or two outputs that have the gene name "GeneX:GeneX" or "GeneX:GeneY:GeneZ" , this was found in SNP.txt under outputs for "nearestgene" with their distance "dist"also displaying 00:00 for instance.

eg. 

TLR9:TLR9 00:00  exonic

Would you be able to explain the meaning of this output?

I can provide more information if need, and wanted to see if you have come across this before?

Kindest regards,

Chris


Kyoko Watanabe

unread,
Jul 9, 2018, 10:05:58 AM7/9/18
to FUMA GWAS users
Hi Chris,

Yes, so that was a bug in the older version of FUMA.
It happened when there are multiple transcripts and assigned to the different exon index. But basically it just same as TLR9 is the nearest gene with distance 0.
I forgot to take the unique genes in that column.
I'm sorry for the confusion.
Did it happen iin a recent job? It should be fixed in the current version.

Best,
Kyoko

chris....@c4xdiscovery.com

unread,
Jul 9, 2018, 10:58:31 AM7/9/18
to FUMA GWAS users
Hi Kyoko,

Yes it has happened in the recent version as of last week, I also have other outsputs with
TLR9:TWF2

Which if I get your meaning, means that both TLR9 and TWF2 has a distance of 0 in association of that independent SNP? If they are both valid hits.

JobID 17830: For this specific data set I used Version 1.3.2 online web tool.

When prioritising which gene has a greater association with "nearestgenes"  in this instance do we disregard the TWF2 hit if it doesn't have an independent SNP itself i.e. it doesn't have a single output with nearest gene TWF2 ,dist 0, the text file only has  TLR9:TWF2 hits.

Don't want to misinterpret what the data is saying.


Kindest regards,

Chris

Kyoko Watanabe

unread,
Jul 9, 2018, 11:00:04 AM7/9/18
to FUMA GWAS users
Hi Chris,

Ah, yes.
So in the old version, I only used ANNOVAR information to assigned nearest gene but not I assign based on the physical distance.
In that case if there is multiple genes, that means those genes are in the same distance or the SNP is overlapping with both genes.
You need to check if TLR9 exist in the column of nearest genes rather than the exact match.

Does this help?

Best,
Kyoko

chris....@c4xdiscovery.com

unread,
Jul 10, 2018, 10:22:39 AM7/10/18
to FUMA GWAS users
Hi Kyoko,

I understand the assigning part and that it is based on phyiscal distance, but I am confused but how it can be distance 0 for two gene that aren't super close. 

I am also trying to figure out now whether to class this section as a new protein product, https://www.ncbi.nlm.nih.gov/ieb/research/acembly/av.cgi?db=human&term=TWF2&submit=Go.
Is the transcript reading through and forming a new protein? 
And that is why we get these TLR9:TWF2

3:52252188:A:Grs143880948352252188AG0.002982NA
6TLR92907intergenic
3:52252996:C:Grs74735459352252996CG0.002982NA
6TLR92099intergenic
3:52254202:A:Grs112906996352254202AG0.002982NA
6TLR9893downstream
3:52255744:C:Trs5743845352255744CT0.002982NA
6
TLR9:TLR900:00exonic
3:52256805:C:Trs35342983352256805CT0.002982NA
6
TLR9:TLR900:00exonic
3:52257183:C:Trs35654187352257183CT0.002982NA
6
TLR9:TLR900:00exonic
3:52258770:C:Trs55921270352258770CT0.002982NA
6TLR9:TLR900:00intronic
3:52263622:A:Grs147881772352263622AG0.002982NA
6TLR9:TLR9:TWF200:00:00intronic
3:52266806:C:Grs140991232352266806GC0.002982NA
6TLR9:TWF200:00intronic
3:52268743:A:Grs142152005352268743GA0.002982NA6TLR9:TWF200:00intronic
3:52269958:A:Crs141823729352269958AC0.002982NA
6TLR9:TWF200:00intronic
3:52270178:A:Grs139728815352270178AG0.002982NA
6TLR9:TWF200:00intronic
3:52271164:C:Trs139354186352271164CT0.002982NA
6TLR9:TWF200:00intronic
3:52271779:C:Grs146527642352271779GC0.002982NA
6TLR9:TWF200:00intronic
3:52272197:A:Crs142044912352272197AC0.002982NA
6TLR9:TWF200:00intronic


So from this read out I would that TLR9 does have a distance 0 with at least 4 SNPs but for the TWF2 I'm not sure now. 

"You need to check if TLR9 exist in the column of nearest genes rather than the exact match."

I don't quite follow your train of thought, are you saying that I need to check each SNP location with the corresponding gene for this double hits?

Sorry for so many questions.

Kindest regards,

Chris

chris....@c4xdiscovery.com

unread,
Jul 10, 2018, 10:22:59 AM7/10/18
to FUMA GWAS users
Hi Chris,

Based on my Ensemble gene v92 (obtained from biomaRt using R), the position of TLR9 is 3:52255096-52273183 and TWF2 is 3:52262626-52273276.
So there is an overlap.

I'm sorry I don't really understand what is the actual problem here?
I thought you wanted to find which SNPs are located within TLR9?
If that is the case, you just need to look for SNPs with TLR9 in the column of nearest gene.
But that's not what you want then?

Best,
Kyoko

chris....@c4xdiscovery.com

unread,
Jul 10, 2018, 10:23:43 AM7/10/18
to FUMA GWAS users
Hi Kyoko,

Thanks for the information, I have checked ensemble gene website to view the overlaps for this particular case, TLR9 and TWF2



I can see that for the proteins TLR9-201 and TWF2-201 there is gene product separation, however there is a read-through product between the two called AC097637.1-201.

The issue is, we don't which gene product to assign the SNP to.

We have lead SNPs which we have use to look at other associated SNPs within their LD that we have checked such as R^2>0.6. 

This has given us a list of genes which we are assigning a confidence score in terms of a good gene to investigate. However, there are a few SNPs that have been assigned to two genes or three within one position i.e. the TLR9:TWF2
So without having to look at the position of each individual SNP and then manual decided which gene is more associated with it, we wanted to ask the question why has this happened first to see if we have to duplicate the hits for these overlapping gene positions or if we ignore them etc.

So far with this data set  (This is not the full list of SNP2GENEs) we have 63 entries where duplicates has happened and some of the gene products stated I don't quite understand for instance: DNAH1:RP11-168J18.6, when I initially looked for  RP11-168J18.6 it wasn't in ensembl, however I have managed to track it down as gene: AC092045.1 ENSG00000239557 via https://link.springer.com/content/pdf/10.1186/s12885-016-2669-3.pdf and PEACHi. 

Should we assign the SNP to both gene products as their distance is the same for both according to the calculation of distance parameters? And is this duplication only occurring for these specific SNPs or is there any overriding in the SNP2GENE distance (ie do i need to check the SNPs either side of these duplicated SNPs to make sure that these pseudogenes have the same number of associated SNPs)?


uniqIDrsIDchrposnon_effect_alleleeffect_alleleMAFgwasPr2IndSigSNPGenomicLocusnearestGenedistfunc
3:52255744:C:Trs5743845352255744CT0.002982NA
1rs1439184526TLR9:TLR90exonic
3:52256805:C:Trs35342983352256805CT0.002982NA
1rs1439184526TLR9:TLR90exonic
3:52257183:C:Trs35654187352257183CT0.002982NA
1rs1439184526TLR9:TLR90exonic
3:52258770:C:Trs55921270352258770CT0.002982NA
1rs1439184526TLR9:TLR90intronic
3:52263622:A:Grs147881772352263622AG0.002982NA
1rs1439184526TLR9:TLR9:TWF20intronic
3:52266806:C:Grs140991232352266806GC0.002982NA
1rs1439184526TLR9:TWF20intronic
3:52268743:A:Grs142152005352268743GA0.002982NA
1rs1439184526TLR9:TWF20intronic
3:52269958:A:Crs141823729352269958AC0.002982NA
1rs1439184526TLR9:TWF20intronic
3:52270178:A:Grs139728815352270178AG0.002982NA
1rs1439184526TLR9:TWF20intronic
3:52271164:C:Trs139354186352271164CT0.002982NA
1rs1439184526TLR9:TWF20intronic
3:52271779:C:Grs146527642352271779GC0.002982NA
1rs1439184526TLR9:TWF20intronic
3:52272197:A:Crs142044912352272197AC0.002982NA
1rs1439184526TLR9:TWF20intronic
3:52321202:A:Grs144890460352321202GA0.002982NA1rs1439184526WDR82:GLYCTK0exonic
3:52324258:A:Grs146239975352324258GA0.002982NA1rs1439184526GLYCTK:GLYCTK-AS10ncRNA_intronic
3:52407668:A:Crs140481308352407668AC0.002982NA1rs1439184526DNAH1:RP11-168J18.60ncRNA_exonic
3:52558480:A:Grs181171122352558480GA0.002982NA1rs1439184526STAB1:NT5DC20UTR3
3:52580652:C:Grs75386783352580652GC0.002982NA1rs1439184526SMIM4:PBRM10UTR3
3:52583643:A:Grs143185796352583643GA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52590175:A:Trs191039949352590175TA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52591264:A:Grs151215721352591264GA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52592101:A:Grs141985360352592101GA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52592908:A:AGrs561733242352592908AGA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52596064:A:Grs189133265352596064GA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52597062:G:GATrs566123391352597062GATG0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52597757:A:Grs145283174352597757AG0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52603522:A:Grs140655062352603522GA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52605915:A:Grs187137760352605915GA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52610790:A:Crs139817169352610790AC0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52612018:A:Grs145865479352612018GA0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52612192:C:Trs149018136352612192CT0.002982NA1rs1439184526SMIM4:PBRM10intronic
3:52716528:A:Crs145308371352716528CA0.002982NA1rs1439184526PBRM1:GNL30intronic
3:52719395:A:AACCrs552930678352719395AACCA0.002982NA1rs1439184526PBRM1:GNL30intronic
3:52847937:C:Grs13322530352847937GC0.002982NA1rs1439184526ITIH4:RP5-966M1.60intronic
3:52847984:A:Grs2276811352847984AG0.002982NA1rs1439184526ITIH4:RP5-966M1.60intronic
3:52848037:C:Trs151083454352848037CT0.002982NA1rs1439184526ITIH4:RP5-966M1.60exonic
3:52848533:A:Trs3774357352848533TA0.002982NA1rs1439184526ITIH4:RP5-966M1.60intronic
5:60189026:C:Grs929780560189026GC0.1123NA1rs26945288ERCC8:GNL3LP10ncRNA_exonic
8:22447426:C:CGGTCCGGGAGrs3830696822447426CGGTCCGGGAGC0.325NA0.778884rs228010411PDLIM2:AC037459.40intronic
8:22447647:C:CTGrs5890055822447647CTGC0.3241NA0.775793rs228010411PDLIM2:AC037459.40intronic
8:22449484:C:Trs4592028822449484TC0.335NA0.813128rs228010411PDLIM2:AC037459.40intronic
8:22449551:C:CATTTTTCTTrs11282127822449551CCATTTTTCTT0.3608NA0.709214rs228010411PDLIM2:AC037459.40intronic
8:22451116:A:Trs13258100822451116AT0.338NA0.775613rs228010411PDLIM2:AC037459.40intronic
8:22451688:C:Grs3064822451688GC0.341NA0.794419rs228010411PDLIM2:AC037459.40UTR3
8:22452357:G:Trs11782130822452357GT0.3419NA0.79068rs228010411PDLIM2:AC037459.40UTR3
8:22452704:A:Grs3735893822452704AG0.3419NA0.79068rs228010411PDLIM2:AC037459.40UTR3
8:22453223:A:Crs11785755822453223CA0.338NA0.783289rs228010411PDLIM2:AC037459.40UTR3
8:22453426:A:Grs11783129822453426GA0.338NA0.783289rs228010411PDLIM2:AC037459.40UTR3
8:22454826:A:Grs3735894822454826AG0.341NA0.802136rs228010411PDLIM2:AC037459.40UTR3
8:22457205:G:Trs755934822457205TG0.335NA0.828706rs228010411AC037459.4:C8orf580UTR5
8:22457206:C:Trs755935822457206TC0.335NA0.828706rs228010411AC037459.4:C8orf580UTR5
8:22457388:A:Grs2272718822457388GA0.332NA0.817831rs228010411AC037459.4:C8orf580intronic
8:22457804:C:Trs746011822457804CT0.332NA0.817831rs228010411AC037459.4:C8orf580intronic
8:22471824:A:Grs3736147822471824GA0.335NA0.829697rs228010411CCAR2:RP11-582J16.50exonic
8:22472255:C:CTGCTGCCTTCATCCTGATGGGTrs71299322822472255CCTGCTGCCTTCATCCTGATGGGT0.3688NA0.723412rs228010411CCAR2:RP11-582J16.50ncRNA_exonic
8:22473158:C:Trs11781149822473158CT0.334NA0.826412rs228010411CCAR2:RP11-582J16.50ncRNA_exonic
8:22473465:C:Trs7843128822473465TC0.3678NA0.72012rs228010411CCAR2:RP11-582J16.50ncRNA_exonic
8:22473850:C:Trs6558167822473850TC0.3668NA0.723889rs228010411CCAR2:RP11-582J16.50ncRNA_exonic
14:88419003:A:Trs23011221488419003AT0.3439NA0.715778rs800517214GALC:FAM35CP0ncRNA_exonic
14:88419374:C:Trs22363621488419374TC0.3439NA0.715778rs800517214GALC:FAM35CP0ncRNA_exonic
17:40646803:A:Grs116516711740646803GA0.2575NA0.768936rs60199917ATP6V0A1:MIR548AT0ncRNA_exonic
17:40702252:C:Trs620758361740702252CT0.2584NA0.790437rs60199917RP11-400F19.8:HSD17B10ncRNA_intronic
17:40705955:C:Trs26765301740705955CT0.2584NA0.790437rs60199917RP11-400F19.8:HSD17B1:RP11-400F19.60ncRNA_exonic


I hope that explains it in more detail.

Please let me know if  anything is specifically is confusing.

Kindest regards,

Chris

chris....@c4xdiscovery.com

unread,
Jul 10, 2018, 10:24:36 AM7/10/18
to FUMA GWAS users
Hi Kyoko,

Following up from this, I have noticed that the gene.txt some how eliminates this problem. 

For instance 

ensgsymbolchrstartendstrandtypeentrezIDHUGOpLIncRVISposMapSNPsposMapMaxCADDeqtlMapSNPseqtlMapminPeqtlMapminQeqtlMaptseqtlDirectionminGwasPIndSigSNPsGenomicLocus
ENSG00000239732TLR935225509652273183-1protein_coding54106TLR90.000525-0.298975228.60NANANANANArs1439184526
ENSG00000173366TLR935225509752265206-1protein_coding54106NANA-0.298975228.60NANANANANArs1439184526
ENSG00000247596TWF235226262652273276-1protein_coding11344TWF20.85781NA4828.6182.14E-057.47E-07GTEx_v6_Cells_Transformed_fibroblasts:GTEx_v7_Cells_Transformed_fibroblastsNANArs1439184526

The only issue is duplication of Genes because of two different starting and ending positions.

However after solving one issue, another comes up, the corresponding number of SNPs to genes don't correspond between the SNP.txt file and the Gene.txt file. As in the Gene.txt file has more SNPs under  posMapSNPs (posMap) than the SNP.txt file, which I don't follow how as the SNP.txt corresponds to all SNPs in your LD block while the Gene.txt is all SNPs for that parameter... so should they not be the same?


Kindnest regards,

Chris

Kyoko Watanabe

unread,
Jul 10, 2018, 4:35:50 PM7/10/18
to FUMA GWAS users

Hi Chris,

First of all, please be aware that all the info from current FUMA is based on hg19 (GRCh37) so if you look up Ensembl please make sure that you are looking at the correct genome assembly.
I am aware that there are notable differences between GRCh37 and GRCh38 (even some ENSG ID are different between the same gene).
But since FUMA only supports GRCh37, I also use Ensembl v92 GRCh37.
See the picture below, this is how it looks for GRCh37.

So the positional mapping is either based on the user selected functional consequence or maximum distance. And nearest gene column in snps.txt file is independent from positional mapping.
For example if SNP A is located within a gene B but less than 10kb away (default parameter), SNP A is also mapped to gene C but the nearest gene column only has gene B since that is the nearest one. Does this make sense?

Best,
Kyoko


chris....@c4xdiscovery.com

unread,
Jul 12, 2018, 6:00:47 AM7/12/18
to FUMA GWAS users
Hi Kyoto,

Thanks again for your help, so it is the maximal and minimum distances from the TSS toe the TES that has been taken for nearest gene (that's why there is overlap here).

For the gene.txt then, is there overlapping of independent SNPs being assigned to multiple genes according to our parameters?

Kindest regards,

Chris

chris....@c4xdiscovery.com

unread,
Jul 12, 2018, 6:00:59 AM7/12/18
to FUMA GWAS users
Hi Chris,

Yes, it is possible that a SNP can be mapped to the multiple genes since the positional mapping is purely based on the physical distance.

Best,
Kyoko
Message has been deleted

chris....@c4xdiscovery.com

unread,
Jul 12, 2018, 6:35:53 AM7/12/18
to FUMA GWAS users
Hi Kyoto,

Sorry misread the reply intially.

So my question is, Is there anyway of knowing which SNPs are the ones within the posMapSNPs or the eQTLMapSNPs assinged to each gene?. Or generate a map with all the positional data for the SNPs.

We arbitrarily inputted only significant SNPs that we wish to investigate, with an arbitrarily high p-value.

Kindest regards,

Chris

Kyoko Watanabe

unread,
Jul 13, 2018, 1:12:40 PM7/13/18
to FUMA GWAS users
Hi Chris,

Yes, so FUMA currently does not output which SNPs are mapped to which genes.
I's relatively easy for eQTL since in the eqtl.txt file you can find both uniqID and gene ID, so you just need to select SNPs with 1 in the column "eqtlMapFilt" in the snps.txt file.
For positional mapping, you need to re-assign SNPs based on the distance.
I've got similar question and I added to my to do list to create one more output table to indicate which SNPs are mapped to which gene by which mapping method.
Hopefully I can update that soon.

Best,
Kyoko
Reply all
Reply to author
Forward
0 new messages