[Genome] aligning short sequences with BLAT

Robert Hunter

unread,

Jan 28, 2008, 12:57:38 PM1/28/08

to gen...@soe.ucsc.edu

Hello. I'm having difficulty with using standalone BLAT ( version 3.4
) to align short DNA sequences. My targets are 30-mers and the
queries are 11-mers. Matches should be anything over 80% identity. I
followed the guidelines listed at the BLAT FAQ, for "Using Blat for
short sequences with maximum sensitivity," however I am still unable
to produce matches, even with exact queries.

E.g,

I have database.fa with a single entry:

>TEST1|offset|123|
TACTGGATTCCGAGACCACACGCGTCGTAG

...and a query.fa with a single entry:

>TEST2|offset|456|
CCGAGACCACA

Using the following command, I expect that I should get a match in
output.psl, however BLAT returns no results (an empty output file).

blat -t=dna -q=dna -fine -tileSize=6 -stepSize=3 -minMatch=1
-repMatch=1000000 -noHead -out=psl database.fa query.fa output.psl

According to the FAQ, a guarantied match should occur when the query size is:
2 * stepSize + tileSize - 1

Am I doing something wrong? Any suggestions would be greatly appreciated.

--
RH

Galt Barber

unread,

Jan 28, 2008, 6:38:22 PM1/28/08

to Robert Hunter, gen...@soe.ucsc.edu

Hi Robert,

I got this to work:

blat -t=dna -q=dna -tileSize=6 -stepSize=3 -minMatch=1 -repMatch=1000000
-minScore=0 -minIdentity=80 -noHead -out=psl database.fa query.fa
output.psl

Basically it was necessary to remove -fine and
to add -minScore=0. (minScore seems to need to be 11 or less).

Some users report -fine as finding extra stuff,
but maybe it also loses some other stuff.

In any case, Jim Kent, the author of BLAT, is working
on a new short exact-match program which might be helpful
if/when it's released.

What you are doing here is stretching it to its absolute limits
of sensitivity, plus you are using zillions of tiny targets.

If both of these sets are from a genome,
another approach might be to blat the sets
to the genome, and then see where their
alignment coordinates overlap to create
a match between the sets.

-Galt

> _______________________________________________
> Genome maillist - Gen...@soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>

Robert Hunter

unread,

Jan 29, 2008, 10:42:48 AM1/29/08

to Galt Barber, gen...@soe.ucsc.edu

On Jan 28, 2008 4:38 PM, Galt Barber <ga...@soe.ucsc.edu> wrote:
> What you are doing here is stretching it to its absolute limits
> of sensitivity, plus you are using zillions of tiny targets.

Galt, thank you for your reply. I was also able to induce a match
given options you suggested. Unfortunately, it wouldn't work for
anything but exact matches ( I was hoping for 80% identity threshold
). I was wondering if decreasing tileSize would yield even further
sensitivity. Unfortunately, this value is apparently already at the
minimum allowed by the software. I even tried modifying the source
code to allow lower values; however I do not fully understand the
relationship of this parameter in the larger scope of the program, so
I may have not made all the necessary changes (I.e., I was unable to
produce 80% matches even with a tileSize of 3).

At this point, I am starting to look into other methods of producing
very large numbers of alignments between very small targets and even
smaller queries. There is much literature on the subject. Would you
have any suggestions of where to start?

Sincerely,

--
RH

Hiram Clawson

unread,

Jan 29, 2008, 12:25:05 PM1/29/08

to Robert Hunter, gen...@soe.ucsc.edu

Good Morning Robert:

You are working with such small sequences, you might want to
do a statistical analysis instead of trying to find matches.

For example, working with sequences of length 11 implies that
for any given 11-mer, it would by chance be found 715 times
in a random genome of length 3 billion. (3 billion / 4^11)

Of course 11-mers are not random in a genome, thus you might want
to characterize all possible 11-mers in a genome by counting
up a histogram of them all, then compare that histogram with
your query 11-mers to see how common they would be in the genome.
If you are considering mis-matches, that makes your query sequences
even more common in your target sequence.

There is a simple search tool in the kent source tree:
findMotif
that can find exact matches for sequences of length 4 to 16 bases.
It is a simple moving window of the query sequence over the target.

--Hiram

Reply all

Reply to author

Forward