Hello Pygr-dev,
It seems that others have previously run into warnings like:
*** WARNING: Unknown sequence hg19.chr6_qbl_hap6 ignored...
*** WARNING: Unknown sequence panTro2.chr6 ignored...
*** WARNING: Unknown sequence ponAbe2.chr6 ignored...
when building an NLMSA using MAF files. I'm running into thousands of
these when using multiz46way and six corresponding genomes, all
downloaded from UCSC. With grep I can verify that, for instance,
ponAbe2.chr6 references exist in chr6.maf and that my ponAbe2 fasta
file really contains a >chr6 header.
How can I determine if these errors originate in my files or my pygr
code? Any suggestions?
Thank you,
Chris
Chris Fuller
ch...@genome.ucsf.edu
The code I'm using (in Eclipse) is:
import os, glob
from pygr import cnestedlist,seqdb
# Create list of full paths to all MAF files involved
maf_path_string = '/home/chris/Storage/Data/Public/Human/hg19_MAF'
maf_files_list = glob.glob(maf_path_string + '/*.maf')
# Create list of full paths to each Genome in single FASTA format
genomes ={}
seqlist = ['hg19','panTro2', 'ponAbe2', 'rheMac2', 'mm9', 'rn4']
genomes_path_string = '/home/chris/Storage/Data/Public/Genomes/
single_file'
seqlist_path = []
for i in range(len(seqlist)):
seqlist_path.append(genomes_path_string + '/' + seqlist[i])
for orgstr in seqlist_path:
genomes[orgstr] = seqdb.SequenceFileDB(orgstr)
genomeUnion = seqdb.PrefixUnionDict(genomes)
# Now build it:
NLMSA_path = '/home/chris/Storage/Data/Public/Human/hg19_MAF/NLMSA'
msa = cnestedlist.NLMSA(pathstem=NLMSA_path, mode='w',
seqDict=genomeUnion, mafFiles=maf_files_list, bidirectional=False)
msa.build(saveSeqDict=True)