Pygr to print MSA from UCSC

7 views
Skip to first unread message

jbiesinger

unread,
Jul 28, 2010, 8:55:31 PM7/28/10
to pygr-dev
Hi!

To learn pygr MSA's a bit better, I thought it would be interesting to
reproduce the output of the mafFrags program from the Kent source
tree. With Mouse as the reference genome, I'd like something along
the lines of:

mm9 TTTCTAAAAA---ACTGAGAAAATG
hg18 TTGGAATTTTTTTTTTCATAAAAATG
canFam2 C--CCGTGCAGAAATCGAGAAATTAAAGATG

Is this even possible with the multiz alignment? Is there another
alignment format that has a global coordinate system to properly show
contiguous sequence in non-reference genomes and/or show insertions/
deletions in the reference genome?

My first attempt was to forget about insertions in all the
'target' (aligning) species. Thus,
<snip>
genome = worldbase('Bio.Seq.Genome.MOUSE.mm9')
msa = worldbase('Bio.MSA.UCSC.mm9_multiz30way')
seq2name = ~(msa.seqDict)
allowedDBs = ['hg18', 'canFam2', 'bosTau3']
allowedIndexes = dict(zip(allowedDBs, range(1, len(allowedDBs)
+1)))
for regIndex, region in enumerate(allRegions):
# init alignment as gaps (----) for all species
alignmentStrs = scipy.array([['-' for i in
range(enhancerLength)] for j in range(len(allowedDBs) + 1)])
alignmentStrs[0,:] = list(str(region)) # fill in mm9 sequence
for mm9Part, otherSeq, edge in msa[region].edges():
# filter edges from organisms we don't care about
orgName = seq2name[otherSeq].split('.')[0]
if orgName in allowedIndexes:
orgIndex = allowedIndexes[orgName]
relativeStart = mm9Part.start - region.start
# overwrite the aligning sequence
alignmentStrs[orgIndex,relativeStart:relativeStart
+len(otherSeq)] = list(str(otherSeq))
outstr = '\n'.join(''.join(alignmentStrs[i]) + (allowedDBs +
['mm9'])[i].rjust(10) for i in range(len(allowedDBs) + 1))
print outstr

</snip>
As I mentioned, the problem with this approach is that it ignores
insertions in any non-mouse species, making different genomic regions
appear contiguous.

I understand that there are two other ways of doing this: letter by
letter using msa[region].letters, which provides a letter-by-letter
iterator, with a graph from the reference genome to each aligning
letter in the other genomes. Something like:
<snip>
for letterNode in msa[region].letters:
print letterNode.edges[0][0] # mouse letter
for mouseLetter, otherLetter, edge in letterNode.edges: # not a
function for some reason...
print otherLetter # print all letters aligning to this one
</snip>
I can't get the final way to work. Using msa[region].regions(), I can
get the same information, but in large chunks rather than letter by
letter.

Thanks in advance for any direction!

Christopher Lee

unread,
Jul 28, 2010, 10:46:58 PM7/28/10
to pygr...@googlegroups.com
Hi,
I think msaslice.regions() is the recommended way to do this. It gives you NLMSASlice(s) indexed in terms of the multiple alignment coordinate system (including all inserts) rather than in terms of a specific sequence. So this will show you how all sequences align to each other, not just to a reference sequence. Note that regions() returns a list, since (due to the length limits of 32 bit integers) NLMSA may have to split the whole alignment into more than one coordinate system, so in principle it is possible for the regions() result to span multiple coordinate systems. So you use it like this

for coordslice in msaslice.regions(): # most likely just one coord system...

Once you have that you can do something like

l = coordslice.edges()
coordmin, coordmax = min([t[0].start for t in l]), max([t[0].stop for t in l])
for coords, otherSeq, e in l:


orgName = seq2name[otherSeq].split('.')[0]
if orgName in allowedIndexes:
orgIndex = allowedIndexes[orgName]

alignmentStrs[orgIndex,coords.start - coordmin\
:coords.stop - coordmin] = list(str(otherSeq))

A couple notes on regions():
- regions() does not work over XMLRPC. You have to be accessing the NLMSA locally.
- as far as I know, not very many people have used this, so you may run into problems. If you do, we'll fix them.

Yours,

Chris

jbiesinger

unread,
Jul 28, 2010, 11:36:37 PM7/28/10
to pygr-dev
Ah, excellent. I'm glad there *is* a way to do this, even though
these alignments are reference-centric.

I saw NLMSA regions before, but unfortunately I haven't been able to
get it to work yet. I had been using a mixture of local and XMLRPC
db's for hg18_multiz44way and regions() would always return an empty
list (some error message would be more helpful).

However, for mm9_multiz30way, perhaps there is some build issue.

genome = worldbase('Bio.Seq.Genome.MOUSE.mm9', download=True)
msa = worldbase('Bio.MSA.UCSC.mm9_multiz30way', download=True)
curRegion = genome['chr1'][3241792:3242792]
msa[curRegion].regions()

---------------------------------------------------------------------------
ValueError Traceback (most recent call
last)

/home/wbiesing/<ipython console> in <module>()

/usr/local/lib/python2.6/dist-packages/pygr/cnestedlist.so in
pygr.cnestedlist.NLMSASlice.regions()

ValueError: no LPO in nlmsaSlice.seqBounds? Debug!


I haven't tried this with any other MSA's (I can only fit so many
genomes on my paltry disk). Perhaps I am on old code?

pygr.__version__
'0.8.1'

Could also be a mixture of DB types that's causing the error. As you
saw in my code, I download the mouse genome first, then the MSA. As a
result, the Mouse genome is a SequenceDB while the others are all
BlastDB's.

{'anoCar1': <BlastDB '/home/shared/pygrdata/genomes/anoCar1'>,
...
'loxAfr1': <BlastDB '/home/shared/pygrdata/genomes/loxAfr1'>,
'mm9': <SequenceFileDB '/home/shared/pygrdata/genomes/mm9'>,
'monDom4': <BlastDB '/home/shared/pygrdata/genomes/monDom4'>,
...
'tupBel1': <BlastDB '/home/shared/pygrdata/genomes/tupBel1'>,
'xenTro2': <BlastDB '/home/shared/pygrdata/genomes/xenTro2'>}

Anything else I can provide to help track down the issue?

jbiesinger

unread,
Jul 28, 2010, 11:44:35 PM7/28/10
to pygr-dev
> Could also be a mixture of DB types that's causing the error.  As you
> saw in my code, I download the mouse genome first, then the MSA.  As a
> result, the Mouse genome is a SequenceDB while the others are all
> BlastDB's.
>
> {'anoCar1': <BlastDB '/home/shared/pygrdata/genomes/anoCar1'>,
> ...
>  'loxAfr1': <BlastDB '/home/shared/pygrdata/genomes/loxAfr1'>,
>  'mm9': <SequenceFileDB '/home/shared/pygrdata/genomes/mm9'>,
>  'monDom4': <BlastDB '/home/shared/pygrdata/genomes/monDom4'>,
> ...
>  'tupBel1': <BlastDB '/home/shared/pygrdata/genomes/tupBel1'>,
>  'xenTro2': <BlastDB '/home/shared/pygrdata/genomes/xenTro2'>}
>
After deleting mm9 from worldbase and reissuing:
msa = worldbase('Bio.MSA.UCSC.mm9_multiz30way', download=True)

So that all the resources are now BlastDB's,
'mm9': <BlastDB '/home/shared/pygrdata/genomes/mm9'>,
I'm still getting the same issue.

Christopher Lee

unread,
Jul 28, 2010, 11:46:14 PM7/28/10
to pygr...@googlegroups.com
Hi,
I don't think SequenceDB vs. BlastDB matters here, so I don't think that's the problem. I will try to reproduce your error message. Unfortunately the one machine that allows ssh logins from outside appears to be down right now, and I can't test this over XMLRPC. So I either need to download a really small NLMSA to my macbook pro to try to reproduce this, or physically go reboot that server, which I won't be able to do till Friday.

-- Chris

jbiesinger

unread,
Jul 29, 2010, 12:53:04 AM7/29/10
to pygr-dev
Unfortunately, it looks like the issue is present on some smaller
MSA's...

msa3 = worldbase('Bio.MSA.UCSC.xenTro1_multiz5way', download=True)
genome = worldbase('Bio.Seq.Genome.XENTR.xenTro1')
region = genome['scaffold_1'][760860:765260]
msa3[region].regions()

jbiesinger

unread,
Aug 9, 2010, 8:46:30 PM8/9/10
to pygr-dev
Any progress on this? Were you able to restart the server?

Christopher Lee

unread,
Aug 11, 2010, 6:52:07 PM8/11/10
to pygr...@googlegroups.com
Hi Jake,
I have posted a fix for this to the "regions" branch of my github repository. Please try it and let me know if this solves your problem. If you're not already accessing my code from github, get it like this:

git clone git://github.com/cjlee112/pygr.git

Then to checkout my "regions" branch:

cd pygr
git checkout remotes/origin/regions

-- Chris

Reply all
Reply to author
Forward
0 new messages