BLAST query analysis

80 views
Skip to first unread message

Jeswin

unread,
Mar 14, 2012, 2:55:04 PM3/14/12
to diy...@googlegroups.com
Hi all,
To prepare probes and primers, I have to use BLAST to determine the
best choices for the sequence of interest. I need to find the number
of each dissimilar sequence to my query. The result is in the format:

Query 1 AACGGCCAGGTCTGTGCCAAGTGTTTGCTGACGCAACCCCCACTGGCTGGGGCTTGGTCA 60
U95551 1162
............................................................ 1221
AJ627221 1160
............................................................ 1219

I need to know how many of each type there is compared to my query.
The above is just an example. The way the boss analyzes is by copying
to MS Word, removing the sequence identifiers (e.g. U95551) and the
numbers. Then he puts it into Excel and sorts it out. Then he
determines how many of each hit there is. In the end, he has a list of
each of the hits and their frequency.

I'm looking for a faster way than his approach. I am looking at the
BIOPerl modules to see if anything matches. Is there something already
available for this purpose?

Thanks

Nathan McCorkle

unread,
Mar 14, 2012, 3:04:20 PM3/14/12
to diy...@googlegroups.com
biopython does blast easily... its pretty well documented too

Are you blasting against local genome data or using web... e.g. NCBI?

> --
> You received this message because you are subscribed to the Google Groups "DIYbio" group.
> To post to this group, send email to diy...@googlegroups.com.
> To unsubscribe from this group, send email to diybio+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/diybio?hl=en.
>

--
Nathan McCorkle
Rochester Institute of Technology
College of Science, Biotechnology/Bioinformatics

Dakota Hamill

unread,
Mar 14, 2012, 4:05:24 PM3/14/12
to diy...@googlegroups.com
I recently read a summary review of 16s primers and their use in classification of new microbial species and they were talking about BLAST vs two other big databases.  They stated that BLAST searching doesn't use a peer reviewed database, so anyone can submit sequences and label it whatever they want.

The point they made was, after producing a bunch of 16s gene fragments from their personal store of samples (known strains) and having them sequenced and then BLASTing them, I think BLAST showed errors more than 20% of the time.  Whereas the other 2 databases that were peer checked agreed either 99% or 100% of the time with eachother.

Anyway, I know it doesn't answer your question, but maybe someone else knows of the other databases that are reviewed, might end up helping you in the long run perhaps because of better quality data you are checking against.

I can't find the article because it was in paper form somewhere around my room, if i find it i'll post it

Jeswin

unread,
Mar 14, 2012, 6:00:00 PM3/14/12
to diy...@googlegroups.com
On Wed, Mar 14, 2012 at 3:04 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
> biopython does blast easily... its pretty well documented too
>
> Are you blasting against local genome data or using web... e.g. NCBI?
>
NCBI.

Don't know any python. i'm really undisciplined sometimes so learning
new languages takes me a while.

I might also add that it would be nice to get the program into
something that the boss can use, like a web app or something, since he
doesn't use the command line.

Nathan McCorkle

unread,
Mar 15, 2012, 8:14:28 AM3/15/12
to diy...@googlegroups.com
Do you know Perl? IMO Perl is much clunkier than Python (which is easy
to use, and IMO is the best get-shit-done language)

Its also pretty easy to build a web-app using python and Django (which
is also very useful, and not too hard)

Cathal Garvey

unread,
Mar 15, 2012, 8:30:04 AM3/15/12
to diy...@googlegroups.com
I love python, but still have a lingering bitterness that lots of the good stuff, including biopython, is python 2 only..

Nathan McCorkle <nmz...@gmail.com> wrote:

--
Sent from K-9 Mail on Android

Jeswin

unread,
Mar 15, 2012, 8:53:47 AM3/15/12
to diy...@googlegroups.com
On Thu, Mar 15, 2012 at 8:14 AM, Nathan McCorkle <nmz...@gmail.com> wrote:
> Do you know Perl? IMO Perl is much clunkier than Python (which is easy
> to use, and IMO is the best get-shit-done language)
>
I got the basics of perl down and will have to review since its been a while.

I woke up today with some psudeocode in my head:
[1] sort lines
[2] let query line be stored in a list
[3] if next line is different from previous line, store in list
[3A] compare next line with previous; if similar, add to a counter and
discard that line
[3B] if next line is different, store in list and loop back to [3A]

Not sure how fast or easy it is to implement. Will have to work on it some more.

> Its also pretty easy to build a web-app using python and Django (which
> is also very useful, and not too hard)
>

Would be nice to learn.

J Adams

unread,
Mar 15, 2012, 11:33:01 AM3/15/12
to diy...@googlegroups.com
Two things:

1) Perl, Python, all of the languages are very similar. If you know C, you
can figure these languages out pretty quickly. There are lot's of books
on-line that are free to give you the command set. The commands and syntax
between these two are almost the same. I prefer to use Java/php myself with
Perl or Python.

2) Especially with web-based apps, always start by doing a Google search for
the type of application you want to build, there are lots of already written
apps on the internet that people are freely sharing that you simply download
and modify to meet your specific needs. Don't reinvent the wheel. My rule
of thumb is use what's there and if I can contribute something...put it out
there for others... Here is a sample of using Perl to BLAST...

http://www.bios.niu.edu/johns/bioinform/perl_blast_parser.htm

If you've never done this before though, one thing to become knowledgeable
of is "server side" versus "client side" programming when building web-apps.
When I started writing web-apps, this was the only thing I found
confusing..but, this concept will come up because you will probably BLAST
sequences, process and store them in a database on a server then do further
processing and display on a client.

My input...

J Adams
Azco Biotech, Inc.
3626 Ocean Ranch Blvd.
Oceanside, CA 92056
e. jad...@azcobiotech.com
t. 858-259-9528
m. 858-525-2770

ruphos

unread,
Mar 15, 2012, 12:41:28 PM3/15/12
to diy...@googlegroups.com
On Thu, Mar 15, 2012 at 5:14 AM, Nathan McCorkle <nmz...@gmail.com> wrote:
Do you know Perl? IMO Perl is much clunkier than Python (which is easy
to use, and IMO is the best get-shit-done language)

Python vs. Perl flame war initiated in 3... 2... 1...

Pretty much anything you can do in one, you can do in the other. Installing and setting up a whole framework is probably a little overkill. Fastcgi and mod_perl (and probably mod_python) are fairly standard in apache installs, it should be fairly feasible to set up a simple HTML form, and have the scripting language of your choice do everything behind the scenes.

I'm a perl advocate personally (line noise is sexy and whitespace is for suckers), but bioperl always ended up seeming too clunky. I've found it easier to use the NCBI API documentation and just roll it myself. There are also plenty of example of such things floating around the internets as well, as was mentioned. 


--
"And if ye cannot be saints of knowledge, then be at least its warriors."
-- Friedrich Nietzsche

Bryan Bishop

unread,
Mar 15, 2012, 12:53:58 PM3/15/12
to diy...@googlegroups.com, Bryan Bishop
On Thu, Mar 15, 2012 at 7:53 AM, Jeswin <phill...@gmail.com> wrote:
> I got the basics of perl down and will have to review since its been a while.
>
> I woke up today with some psudeocode in my head:

from now on this is how i am showing people about python:
http://diyhpl.us/~bryan/irc/meetlog/2010-01-01_to_2010-03-01-lines2.png

or javascript:
http://learnjs.info/#mode=visualize

dunno if there's one of these for perl... that would be nice.

- Bryan
http://heybryan.org/
1 512 203 0507

Bryan Bishop

unread,
Mar 15, 2012, 12:54:33 PM3/15/12
to diy...@googlegroups.com, Bryan Bishop
On Thu, Mar 15, 2012 at 11:53 AM, Bryan Bishop <kan...@gmail.com> wrote:
> from now on this is how i am showing people about python:
> http://diyhpl.us/~bryan/irc/meetlog/2010-01-01_to_2010-03-01-lines2.png

obv. the wrong link.. sorry, i meant:
http://people.csail.mit.edu/pgbovine/python/tutor.html#mode=edit

Pat

unread,
Mar 22, 2012, 5:15:16 AM3/22/12
to diy...@googlegroups.com
@philly Are you using BLASTN or Primer-BLAST?

Jeswin

unread,
Mar 22, 2012, 9:24:06 AM3/22/12
to diy...@googlegroups.com
On Thu, Mar 22, 2012 at 5:15 AM, Pat <ele...@gmail.com> wrote:
> @philly Are you using BLASTN or Primer-BLAST?
>
I guess BLASTN. I haven't heard of Primer-BLAST until now.

The HBV genome is highly polymorphic so getting a primer that will
bind to most of the DNA fragments is a bit hard to make. I'm still new
to this so maybe it is easier for others.

Pat

unread,
Mar 23, 2012, 8:03:54 AM3/23/12
to diy...@googlegroups.com

Jeswin

unread,
Mar 23, 2012, 11:15:24 AM3/23/12
to diy...@googlegroups.com
On Fri, Mar 23, 2012 at 8:03 AM, Pat <ele...@gmail.com> wrote:
> Check it out when you get a chance: http://www.ncbi.nlm.nih.gov/guide/howto/design-pcr-primers/
>
Looking at the site, it seems that it checks the primer that you have
already built against the database. We need the opposite. We design
the primer based on the most common mutations for the WT sequence site
of interest which is known to us.

The primer should bind to the most commonly seen mutations.

Cathal Garvey

unread,
Mar 23, 2012, 11:17:16 AM3/23/12
to diy...@googlegroups.com
Handy for ruling out nonspecific binding then but not for designing primers from scratch?

Jeswin <phill...@gmail.com> wrote:

>--
>You received this message because you are subscribed to the Google
>Groups "DIYbio" group.
>To post to this group, send email to diy...@googlegroups.com.
>To unsubscribe from this group, send email to
>diybio+un...@googlegroups.com.
>For more options, visit this group at
>http://groups.google.com/group/diybio?hl=en.

--

Nathan McCorkle

unread,
Mar 23, 2012, 12:54:03 PM3/23/12
to diy...@googlegroups.com
On Fri, Mar 23, 2012 at 11:15 AM, Jeswin <phill...@gmail.com> wrote:
> On Fri, Mar 23, 2012 at 8:03 AM, Pat <ele...@gmail.com> wrote:
>> Check it out when you get a chance: http://www.ncbi.nlm.nih.gov/guide/howto/design-pcr-primers/
>>
> Looking at the site, it seems that it checks the primer that you have
> already built against the database. We need the opposite. We design
> the primer based on the most common mutations for the WT sequence site

if that's the case, don't you want to do an alignment of all the
possible sequences for your gene, the few highest scoring alignments
would seem to be the primer you want. (are you doing PCR, or is this
for hybridization/blot assays?)

Jeswin

unread,
Mar 24, 2012, 12:20:10 PM3/24/12
to diy...@googlegroups.com
On Fri, Mar 23, 2012 at 12:54 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
> if that's the case, don't you want to do an alignment of all the
> possible sequences for your gene, the few highest scoring alignments
> would seem to be the primer you want. (are you doing PCR, or is this
> for hybridization/blot assays?)
>
We do PCR, but we know the mutation site. To build the primers, we go
through a sequence(size can be a few hundred bp long) around our PCR
site of interest. We expect to find a place where the sequences are
highly identical to our query. Sometimes, we find a base mutation that
is common in all our hits so we can "wobble" that base. We like to
make the primer in a "clean" location, i.e., there is not a lot of
variation in the mutations along the possible primer site.

I attached a file I worked on. We got 6000 hits for this 60bp BLAST
query. We looked through a total of 500bp to find a good primer site.
You can see in the file that the 3rd bp has Thymine as a common base,
so we can wobble that. Eventually, we determine the best site to build
a primer by analyzing all the Hits together.

It is very time consuming, and I want to automate at least half of the
process. All this sorting and editing with MS WORD and EXCEL is
time-consuming and tiring on my eyes. The boss is used to it and he
doesn't mind.

seq_1901-1960.pdf

MikeyJ

unread,
Mar 25, 2012, 7:41:32 AM3/25/12
to diy...@googlegroups.com
I think you need to take a different approach to this for a couple of reasons.

1. the frequency of particular mutations in the database reflects what is deposited - so common mistakes, common biases etc. not what is necessarily the most frequent.  
2. BLAST is a local rather than global aligner and is not the best tool for what is essentially a phylogenetics task - as Dakota mentions
3. having your sequences in Word and Excel makes them very difficult to handle and can result in creating and propagating errors (for example http://www.biomedcentral.com/1471-2105/5/80)

My approach would be to get as many of the sequences into an Arb database http://www.arb-silva.de/ (free) align them using a global aligner (clustal, MUSCLE etc) and then manually check the quality of the alignment.  Use Arb's PT server to generate a quickly searchable database of sequences and then get it to select potential primers for you that discriminate between subsets of the sequences (can be grouped via phylogenetic tree) or mutations.  The nice thing with this approach is that you have a database that you can add to when new sequences become available and check whether you need to update your primers or not.  

How common the particular mutations are and therefore how useful to you each primer set is can be overlaid afterwards, if you have the frequency information as part of the primer search then database errors are propagated.  Better to define what different types there are first and then to look at frequency second (or as an experiment).  

Where you may come across difficulties with the above is that I'm more used to doing this with single genes or operons rather than whole genomes which are likely to be harder to align with higher variability (though viral genomes aren't that big).  Is there a typing scheme for hepB?  Can you align representatives within the subtypes first?  How many whole genome sequences are available and has anyone done any comparative genomics?  The above approach will definitely work if you're only interested in designing primers for one of the genes.  

Cheers

Mike


On Friday, March 23, 2012 3:15:24 PM UTC, phillyj wrote:

Dakota Hamill

unread,
Mar 25, 2012, 12:25:45 PM3/25/12
to diy...@googlegroups.com
"A third reason for error applies primarily to unverified databases such as GenBank, which accept any linked name and sequence that is sent to them. In the 1995 investigation of GenBank sequences, a comparison of sequences for the presumed same species showed that almost 20% had more than 2% variability, indicating that many genetically different strains were being deposited under the same species name (18). On some occasions, the name associated with a given sequence may not be correct due to poor phenotypic descriptions or faulty testing."

Here was the article I had mentioned way up above if anyone is still interested

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC523561/?tool=pubmed

It should be free, was an interesting read, and although maybe not fully related, it did delve into assigning phylogeny of bacteria using 16s sequences and the problems associated with it

and this was the commercial database they were talking about  http://www.appliedbiosystems.com/absite/us/en/home/support/software/dna-sequencing/microseq-id-systems.html but I can't tell if it's free or not it keeps giving this computer I am using an unzip error or something, and since it looks like it is only 16s and a fungal ID'er...not of much use.  Oh well, something to be aware of might be useful in the future.  Good luck on your quest

-Dakota


Reply all
Reply to author
Forward
0 new messages