Fwd: Blast in pygr

Julien Bellis

unread,

Aug 24, 2010, 5:08:31 AM8/24/10

to pygr...@googlegroups.com

Hi,

I need to map oligos from Affymetrix arrays (25 mers) to oligos from Operon (70 mers). For that I use the blast function under pygr in the following way :

from pygr import seqdb

from pygr import blast

operon = seqdb.SequenceFileDB('OPERONprobes.fasta')

blastmap = blast.BlastMapping(operon)

affy = seqdb.SequenceFileDB('ATH1probes.fasta')

numberprobe = len(affy)

for i in numberprobe:

nameprobe = affy.keys()[i]

probemap = affy[nameprobe]

try:

fprobe = blastmap[probemap]
edges.extend([fprobe.edges()])

del fprobe

except NameError:

edges = fprobe.edges()

except:

pass

'OPERONprobes.fasta' contains the operon oligos, it's a fasta file with 29000 sequences, and the 'ATH1probes.fasta' contains the affymetrix oligos in fasta format, it contains 250 000 oligos sequences.
It works this way, however it's rather slow, the complete mapping will take around 40 hours.
Lookink in the tutorial on the blast module I found that it is possible to pass a 'queryDB' argument in the form of a dictionary containing multiple sequences to be used as queries

Useful methods:
BlastMapping.__call__(seq, al=None, blastpath='blastall', blastprog=None, expmax=0.001, maxseq=None, opts=", verbose=None, queryDB=None)

I tried several things but I couldn't make it work. I have several questions on how to use this function :
How to make the dictionary ? In my case should this dictionnary be comprised of the sequence I want to use to query or to map to, meaning should this dictionnary contains the sequence of the Operon arrays or of the Affymetrix arrays ?
In the kwargs argument, the seq should be the operon I guess ?
After having done the BlastMapping using the blastmap = blast.BlastMapping(operon,**kwargs), how to query this blast ?
Do I have to query it sequence by sequence or can I query it using the dictionnary I used as an argument in the kwargs ?
I tried this :

seq_db = dict(seq1=affy[affy.keys()[1]], seq2=affy[affy.keys()[2]], ...)

kwargs = {"seq":'operon', "al":None, "blastpath":'blastall', "blastprog":'blastn', "expmax":0.001, "maxseq":None, "verbose":None, "queryDB":'seq_db'}
blastmap = blast.BlastMapping(operon,**kwargs)

fprobe = blastmap[seq_db]

Using the function this way I got this error message :

Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/pygr/blast.py", line 191, in __getitem__
    al = self.__call__(k) # run BLAST & get NLMSA storing results
File "/usr/local/lib/python2.6/dist-packages/pygr/blast.py", line 348, in __call__
    blastprog = self.blast_program(seq, blastprog)
File "/usr/local/lib/python2.6/dist-packages/pygr/blast.py", line 301, in blast_program
    blastprog = blast_program(seq.seqtype(), self.seqDB._seqtype)
AttributeError: 'dict' object has no attribute 'seqtype'

any help will be very useful,

regards,
Julien Bellis

Christopher Lee

unread,

Aug 24, 2010, 2:05:54 PM8/24/10

to pygr...@googlegroups.com

On Aug 24, 2010, at 2:08 AM, Julien Bellis wrote:

> 'OPERONprobes.fasta' contains the operon oligos, it's a fasta file with 29000 sequences, and the 'ATH1probes.fasta' contains the affymetrix oligos in fasta format, it contains 250 000 oligos sequences.
> It works this way, however it's rather slow, the complete mapping will take around 40 hours.
> Lookink in the tutorial on the blast module I found that it is possible to pass a 'queryDB' argument in the form of a dictionary containing multiple sequences to be used as queries
>
> Useful methods:
> BlastMapping.__call__(seq, al=None, blastpath='blastall', blastprog=None, expmax=0.001, maxseq=None, opts=", verbose=None, queryDB=None)
>
> I tried several things but I couldn't make it work. I have several questions on how to use this function :
> How to make the dictionary ? In my case should this dictionnary be comprised of the sequence I want to use to query or to map to, meaning should this dictionnary contains the sequence of the Operon arrays or of the Affymetrix arrays ?
> In the kwargs argument, the seq should be the operon I guess ?
> After having done the BlastMapping using the blastmap = blast.BlastMapping(operon,**kwargs), how to query this blast ?
> Do I have to query it sequence by sequence or can I query it using the dictionnary I used as an argument in the kwargs ?

Here's an example that I've abstracted from one of our BLAST tests in tests/blast_test.py (see get_multiblast_results() or test_multiblast_long() for another example):

protDB = seqdb.SequenceFileDB(sp_hbb1)
blastmap = blast.BlastMapping(protDB, verbose=False)
al = cnestedlist.NLMSA('blasthits', 'memory', pairwiseMode=True,
bidirectional=False)
blastmap(al=al, queryDB=protDB) # all vs all
al.build() # construct the alignment indexes
for seq in protDB.values(): # look at hits for one sequence at a time...
results = al[seq]

A few notes:
- the sequence "dictionary" should be a sequence database object. Just pass your "affy" SequenceFileDB object.
- blastmap() stores all the results in a single NLMSA alignment. In the example above the NLMSA is opened in "memory" mode (i.e. stored in RAM); if you want the results stored on disk instead use mode 'w'.
- Note the NLMSA must be created in pairwiseMode=True to reflect the nature of blast mappings (i.e. blast is a pairwise alignment, not a true multiple sequence alignment).
- you access the results from the NLMSA in the usual way as illustrated above...

Hope that helps!

-- Chris Lee

Julien Bellis

unread,

Aug 25, 2010, 4:06:02 AM8/25/10

to pygr...@googlegroups.com

Hi Chris,

thanks for your answer. I tried with your settings but it doesn't help
to make the query faster, so I guess I just have to wait !