Problems Building NLMSA from MAF

14 views
Skip to first unread message

Chris Fuller

unread,
Sep 16, 2010, 5:29:35 PM9/16/10
to pygr-dev
Hello Pygr-dev,

It seems that others have previously run into warnings like:

*** WARNING: Unknown sequence hg19.chr6_qbl_hap6 ignored...
*** WARNING: Unknown sequence panTro2.chr6 ignored...
*** WARNING: Unknown sequence ponAbe2.chr6 ignored...

when building an NLMSA using MAF files. I'm running into thousands of
these when using multiz46way and six corresponding genomes, all
downloaded from UCSC. With grep I can verify that, for instance,
ponAbe2.chr6 references exist in chr6.maf and that my ponAbe2 fasta
file really contains a >chr6 header.

How can I determine if these errors originate in my files or my pygr
code? Any suggestions?

Thank you,

Chris

Chris Fuller
ch...@genome.ucsf.edu

The code I'm using (in Eclipse) is:

import os, glob
from pygr import cnestedlist,seqdb

# Create list of full paths to all MAF files involved
maf_path_string = '/home/chris/Storage/Data/Public/Human/hg19_MAF'
maf_files_list = glob.glob(maf_path_string + '/*.maf')

# Create list of full paths to each Genome in single FASTA format
genomes ={}
seqlist = ['hg19','panTro2', 'ponAbe2', 'rheMac2', 'mm9', 'rn4']
genomes_path_string = '/home/chris/Storage/Data/Public/Genomes/
single_file'
seqlist_path = []
for i in range(len(seqlist)):
seqlist_path.append(genomes_path_string + '/' + seqlist[i])

for orgstr in seqlist_path:
genomes[orgstr] = seqdb.SequenceFileDB(orgstr)
genomeUnion = seqdb.PrefixUnionDict(genomes)

# Now build it:
NLMSA_path = '/home/chris/Storage/Data/Public/Human/hg19_MAF/NLMSA'
msa = cnestedlist.NLMSA(pathstem=NLMSA_path, mode='w',
seqDict=genomeUnion, mafFiles=maf_files_list, bidirectional=False)
msa.build(saveSeqDict=True)


Namshin Kim

unread,
Sep 16, 2010, 9:28:13 PM9/16/10
to pygr...@googlegroups.com
Hi Chris,

I think what you should check SequenceFileDB whether the the given chromosome is in it.

i.e. call the given chromosome,

>>> hg19['chr6_qbl_hap6']
>>> ponAbe2['chr6']

I don't see any problems with your NLMSA building script.

If you want to build NLMSA from MULTIZ alignments from UCSC genome browser not custom ones, I suggest another solution. You can download all genomes and pre-built NLMSA files available at http://biodb.bioinformatics.ucla.edu/PYGRDATA/ and http://biodb.bioinformatics.ucla.edu/GENOMES/

Or, you can give download=True option for automatic downloading and building SequenceFileDB and NLMSA from biodb2.bioinformatics.ucla.edu

WORLDBASEPATH = '.,http://biodb2.bioinformatics.ucla.edu:5000' # save your resources in '.' current directory, and connect biodb2 pygr resources via http://biodb2.bioinformatics.ucla.edu:5000

from pygr import worldbase
worldbase('Bio.MSA.UCSC.hg19_multiz46way', download=True)

--
Namshin Kim






--
You received this message because you are subscribed to the Google Groups "pygr-dev" group.
To post to this group, send email to pygr...@googlegroups.com.
To unsubscribe from this group, send email to pygr-dev+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en.


Chris Fuller

unread,
Oct 4, 2010, 8:25:37 PM10/4/10
to pygr-dev
Hi Namshin,

Thank you for the prompt reply on this. I eventually gave up trying
to understand the error and, as you suggested, downloaded the pre-
built NLMSA file. Regardless of how I do this (automatic download or
using NLMSABuilder on a manually downloaded file), it always throws
this error:

pygr.metabase.WorldbaseNotFoundError: 'unable to find
Bio.MSA.UCSC.hg19_multiz46way in WORLDBASEPATH'

Using worldbase.dir(), Bio.MSA.UCSC.hg19_multiz46way shows up along
with local copies of the genomes that I've downloaded. I can access
and use the genomes just fine, but the MSA throws this error. Yet,
all of these files reside in the same folder!

Any suggestions?

Chris



On Sep 16, 6:28 pm, Namshin Kim <n...@rna.kr> wrote:
> Hi Chris,
>
> I think what you should check SequenceFileDB whether the the given
> chromosome is in it.
>
> i.e. call the given chromosome,
>
> >>> hg19['chr6_qbl_hap6']
> >>> ponAbe2['chr6']
>
> I don't see any problems with your NLMSA building script.
>
> If you want to build NLMSA from MULTIZ alignments from UCSC genome browser
> not custom ones, I suggest another solution. You can download all genomes
> and pre-built NLMSA files available athttp://biodb.bioinformatics.ucla.edu/PYGRDATA/andhttp://biodb.bioinformatics.ucla.edu/GENOMES/
>
> Or, you can give download=True option for automatic downloading and building
> SequenceFileDB and NLMSA from biodb2.bioinformatics.ucla.edu
>
> WORLDBASEPATH = '.,http://biodb2.bioinformatics.ucla.edu:5000'# save your
> resources in '.' current directory, and connect biodb2 pygr resources viahttp://biodb2.bioinformatics.ucla.edu:5000
>
> from pygr import worldbase
> worldbase('Bio.MSA.UCSC.hg19_multiz46way', download=True)
>
> --
> Namshin Kim
>
> > pygr-dev+u...@googlegroups.com<pygr-dev%2Bunsu...@googlegroups.com>
> > .
Reply all
Reply to author
Forward
0 new messages