IUPAC nucleotide codes

53 views
Skip to first unread message

Hocum, Jonah Daniel

unread,
Feb 28, 2014, 6:20:00 PM2/28/14
to gen...@soe.ucsc.edu

IUPAC nucleotide codes

 

Hello,

 

Does anyone know how web version and the standalone version of BLAT handle different IUPAC nucleotide codes?  For example, this query ‘AATAAAGTCTAAKTTAAAATCTGGAGCTGCCTTGGAGGAGAAAAGT’ contains a ‘K’ for G or T.  When aligned against hg19 with BLAT the highest scoring alignment is against chromosome 3, and with the standalone version the highest scoring alignment is against chromosome 9.  When looking at the details from the web version, it appears the ‘K’ is dropped:

 

cDNA YourSeq

 

AATAAAGTCt AATTAAAATC TGGAGCTGCC TTGGAGGAGA AAAGT


Genomic chr3 :

 

tcctgacttc aggtgatccg cccgcctcag gctcccaaag tgctgggatt  149377395

acaggcatga gccaccgcgc ccagcctgcc ttaatatttt tacagggtaa  149377445

AATAAAGTCg AAgTTAAAAT CTGGAGCTGC CTTGGAGGAG AAAAGTttaa  149377495

ggaaaagaca aggccactca tagttttgcc tcggaaaagg tagaattttg  149377545

gggccactcc ctgaatggct gcatccatat ccaaaacaga accacc

 

 

I am trying to figure it out myself, but if anyone knows the answer I would appreciate any help!

 

Jonah

Jonathan Casper

unread,
Mar 3, 2014, 4:45:44 PM3/3/14
to Hocum, Jonah Daniel, gen...@soe.ucsc.edu

Hello Jonah,

Thank you for your question about using IUPAC nucleotide codes with the BLAT tools. I have passed along your question about the different treatments of "K", but it sounds like BLAT, along with the rest of the kent tools, does not support IUPAC codes. One of our engineers warns that using too many IUPAC codes may convince the web version of BLAT to treat your query as a protein sequence, unless you change the query type from "BLAT's guess" to "DNA".

My standalone BLAT searches for your sequence in the hg19 genome assembly found the same region of chromosome 3 as the web version - I wasn't able to find any hits on chromosome 9. Perhaps this is due to some additional parameters you have passed to the standalone version? More information on how to tune the command-line version of BLAT to produce similar results to the web version is available at http://genome.ucsc.edu/FAQ/FAQblat.html#blat5.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group



--


Jonah Hocum

unread,
Mar 5, 2014, 12:31:26 PM3/5/14
to Jonathan Casper, gen...@soe.ucsc.edu
Thank you Jonathan Casper,

I realized I truncated my original query, so it makes sense you did not get any hits on chromosome 9 and thank you for your answer about IUPAC codes support.

There is still another related discrepancy between the web and stand-alone version of BLAT that I don't think is related to the chosen parameters.  I ran the same query, replacing any letter that was not A, T, C, or, G with N, with the web and stand-alone version against hg19 chr3.2bit.  The parameters could affect the score, but shouldn't the %ID be identical?  The span of the alignment for the query and the target sequence, including gaps, are identical.  Maybe I am just missing something obvious.  The web version gives my 97% ID and the stand-alone version, which looks correct to me, gives me 92%. 

Here is the query:

aataaagtctaanttaaaatctggagctgccttggaggagaaaagtttaaggaaaagaca
aggccactcatagttttgcctcnnaaaaggtagaattttggggccnctccctgaatggct
gcatccatatccaaaacagaaccaccaaagtgagccacttcccctgttatctgtacttgn
aggtggctccaattccagactcctcatagactggaagaaattagggccatcttagactaa
ngcaggnatacacgtatcatcctttttttttttttttttttganatggantctcnctcta
ttgcccangatggaatgcantggcntgattncggctccccgcaacctctgcctccctggt
tcaagcaattatcctgcctcagcctcccgantanntgttattcngtgcancaagncctcc
gtccatangacnggcaaaangaaaggggaaactagcacangncactccttggaaagnana
atctttgcnagct



Jonah


--
 

To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

output-standalone.out

Matthew Speir

unread,
Mar 6, 2014, 12:21:34 PM3/6/14
to Jonah Hocum, Jonathan Casper, gen...@soe.ucsc.edu
Hi Jonah,

Thank you for your question. Since standalone BLAT does not have a column for percent identity, I'm assuming that you are calculating the percent identity yourself. Taking the sequence you provided, the top hit shows there are 453 matches, 9 mismatches, and 28 Ns. The method web BLAT uses to calculate percent identity is described on the BLAT FAQ page, http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. Web BLAT takes insertions in the query sequence into account, as well as excluding the Ns from the calculation altogether. If we follow the method described in that FAQ, we get the following: (9+3+2)/(453+9) = 97%. Alternatively, we can include these Ns as mismatches. If we do that, we get: (9+28)/(453+9+28) = 92%. This matches the percent identity that you reported. It basically comes down to whether you count the Ns as mismatches or non-aligning parts. Since the query contains Ns, these Ns should probably be ignored since they are not really true mismatches.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Jonah Hocum

unread,
Mar 6, 2014, 1:27:46 PM3/6/14
to gen...@soe.ucsc.edu
That makes sense.  I was actually using the blast8 output format which does calculate the %ID, but it counts the Ns as mismatches.  I am going to switch to the psl and calculate the %ID as described in the link provided.

Thank you!

Jonah
Reply all
Reply to author
Forward
0 new messages