expanded NLMSA interfaces?

2 views

Skip to first unread message

Christopher Lee

unread,

Apr 23, 2008, 4:53:46 PM4/23/08

to pygr...@googlegroups.com

Hi,
I'm now going to send messages to the Google group address instead of
individual emails...

Titus and Namshin have raised some excellent questions about expanding
the NLMSA interfaces for new ways of searching. These seem like great
ideas, and should be fairly easy to provide. The real question is
what we want. I'll try to summarize some possibilities, and give some
background on what's easy vs. hard

Possible new interfaces:
- joins: if you want to join two entire datasets, there's a huge
performance gain for doing it inside pyrex / C as a single method call
from Python (instead of using Python to iterate over one dataset as
individual queries to the other dataset). My feeling is that this
should be a core capability of NLMSA.

- iterator: "walk" the stored intervals / annotations in order on the
chromosome. Again, this seems like it should be a core capability.

- closely related issues like "find nearest neighbor" or "distance to
nearest neighbor", all flow from the same capability.

- different kinds of queries: Right now there is only one kind of
query ("find overlaps"). Other possibilities: "find exact match",
"find contained intervals", "find intervals containing", etc. the 13
Allen interval logic relations. "find N neighbors".

- we could integrate filtering into querying: Right now, filtering is
a NLMSASlice method, which means you do it AFTER querying. If the
filter removes a huge volume of results, it clearly would be better to
do that during querying. E.g. I only want to get results that map to
a given target genome or region, etc.

- dynamic (read-write) NLMSA: lets you add new alignments /
annotations at any time. This could either be done using the existing
storage (a new rebuild() method would minimize the time required by
limiting the build just to the parts that changed), or creating a new
storage that is truly tree-based.

- reduced memory requirement for building: I'm not sure if Namshin or
Titus feel this is an important issue. Currently, NLMSA builds each
index in memory. That means it has to split the data into a number of
indexes to guarantee that the build() will not run out of memory. (no
phase other than building requires loading a whole index into
memory). The main consequence is that the default size value will not
work if the machine doesn't have enough memory. In that case the user
will have to re-run the build with an explicit size parameter
appropriate for their memory size. This whole issue could be made to
go away by making build() use a number of temp files. This would be a
bit slower (because it has to do more disk I/O) but could be an
automatic "fallback" method if the regular build runs out of memory.

Background: what's easy vs. hard
NLMSA is inherently ordered, so it's trivial to walk all or part of
the results in order (sorted by interval start). There already exists
an internal iterator class (IntervalFileDBIterator) that gives a nice
interface to this: you give a query (start,stop) interval and get an
iterator on that region. You could draw the analogy that our current
NLMSASlice is like keys() (i.e. it loads the results into memory),
whereas this iterator is like iter() (i.e. it walks the results
without loading them as a group into memory, which makes it suitable
for walking an entire chromosome or alignment).

- Providing an interface for iterating forward is trivial (that's what
the iterator already does). An interface for walking backward would
be pretty easy too. The same class could also do various things like
find neighbors, etc.

- reduced memory build is also pretty easy. Moreover it has no
interface issues...

- different kinds of queries are pretty simple. The main question is
deciding on interface.

- dynamic (read-write) rebuild is feasible, even with the existing
NLMSA storage format. The most complicated part is figuring out the
subset of the NLMSA that has to change. One caveat is that if a top-
level list expands much, that will force re-packing of that entire
index file (which will take time).

- joins: the basic operation is to compose two maps A->B and B->C to
obtain a map of A->C.
Implementing joins is more work than the other features above. I'll
think about exactly what's involved before trying to give a summary of
the effort required.

Questions:
Titus, can you clarify precisely what benefit you're hoping to get
from altering build() behavior? Do you want to
* have a better error message (if you think the current message is
confusing)?
* have NLMSA automagically do the build for the user, instead of
raising an exception? (this could be accompanied by a warning message
explaining the situation a la "you really should have called build()
first, but we'll build it for you anyway... this may take a little
while... in the future please call build() first. To hide this
message, set verbose=False".
* something else?

I'm not sure I get the idea of requiring the user to capture the
return value from build(), and use that object as their query
interface (instead of the NLMSA object they originally constructed).
At some level, this leaves the situation unchanged: an "unbuilt NLMSA"
raises a KeyError if you try to query, whereas a "built NLMSA" works.
So on the one hand I don't see a clear benefit, and on the other hand
this seems like a change that will confuse many users. Maybe I'm
missing something...

Yours,

Chris

Namshin Kim

unread,

Apr 28, 2008, 9:29:22 AM4/28/08

to pygr...@googlegroups.com

Hi Chris,

I think the first easy, but very useful interface to implement is to add "find nearest neighbor" feature. Some people already asked that question in forum site. What I am proposing is very simple. After we sort all intervals, just add two attributes which have distance from previous/next annotation (or interval). Then we can "walk(iterate)" the stored intervals/annotations.

I am doing some integration project using pygr. For example, if I made dbSNP annotation database, I want to "walk" the annotation without scanning whole chromosome. Then, I can extract/scan the whole dbSNP. Or it could be exons, introns, or whatever. I was thinking about adding two attributes in annotation object (distances from previous/next annotation) and making some scripts to add this features, but it would be very powerful if we add that features into NLMSA feature itself.

Yours,
Namshin Kim

cjlee112

unread,

Apr 28, 2008, 7:50:53 PM4/28/08

to pygr...@googlegroups.com

Hi Namshin,
I want to understand your proposal better. It seems to me there are
several possible interpretations of what you said...

- find nearest neighbor: this is easy to implement; the only real
question is what interface you want. Titus suggested an interface
like a standard query (i.e. you supply a query interval), which would
return the closest hit around that interval. Presumably it should
also return a flag indicating whether the hit overlaps your query or
not. Then we have to decide what happens if there are hits that
OVERLAP the query. Do we want it to return a single overlapping hit
or ALL overlapping hits? Also, do we want it to behave like our
current query (return hits clipped to just their intersection with the
query), or instead return the hits "whole" (i.e. unclipped). The
latter seems more appropriate in this case. Finally, what do we want
it to return if there is no overlapping interval? Just one closest
hit? What if there are ties (two or more hits equally close to the
query); do we return them all?

It seems to me this can be generalized to give the user control over
whether they want nearest on the left, nearest on the right, etc.

2. walk(iterate): this is a very different usage than the query usage
outlined above. Instead of targeting a single query interval, this
enumerates an entire set of intervals ordered by coordinate. I can't
tell from your description whether the real reason you want this is to
do joins that combine multiple mappings, or just for the ability to
enumerate a set of mappings (SNP annotations, exons, alignments,
whatever) ordered by source chromosome coordinate.

3. Regarding the idea of added attributes for left vs. right neighbor,
I need to understand better what you want. If this is a mapping of
one genome to one other genome (or annotation type) then the left and
right neighbor are simply adjacent to you in the NLMSA file, so there
is no need to store pointers to them in each record. By contrast,
this idea might be useful when a single mapping mixes multiple
coordinate systems (genomes, types of annotations, whatever). That is
exactly what happens in a true multiple sequence alignment, which uses
an LPO coordinate system as the "hub" of the alignment. An LPO
mapping mixes all the target genomes / annotations.

The problem is deciding how you want to define "neighbor", which
really is a question about what sequences get grouped as a single unit
for the purposes of "neighboring". For example, if you group at the
level of genome, then a mapping to mm7.chr1 might point to a right
neighbor that maps to mm7.chr16 (even though that's a different
chromosome). Some people might consider this surprising. If you
treat each sequence as separate, then a mapping to panTro1.scaffold997
might not even have a right neighbor (if we're at the end of the
scaffold -- even if that is really in the middle of a panTro
chromosome...). Some people might consider this surprising. Another
issue is that in a true MSA, the neighbors will all be determined in
the LPO coordinate system -- which really means on the reference
genome (e.g. hg18). Will people find this confusing?

Thus you can see that the idea of neighbors is intimately connected to
the question of "coordinate system unification", i.e. making a set of
sequences that we want to be able to walk neighbors on to be defined
as a contiguous interval in the union coordinate system. As I
emphasized in my discussion of joins, we need to do this anyway.
Right now our sequences are added to the union coordinate system in
the order that they are encountered while reading in the alignment --
which seems rather arbitrary. Group and ordering sequences in the
union in consistent ways will be crucial for joins, efficient target
filtering ("get me all hits in this genome") etc.

4. These complicated issues, and your comments make me wonder whether
it would be better to think of this in terms of JOIN operations. A
join captures precisely the issue of sorting things into a desired
coordinate system order (possibly via a pre-computed index), to
efficiently combine them. The user would specify the join they want,
which tells the system what coordinate system to sort (and join) on.
Making this explicit may make joins more general, and less potentially
confusing than a single pointer hardwired into each interval mapping
record as outlined above...

C. Titus Brown

unread,

Apr 29, 2008, 10:24:18 PM4/29/08

to pygr...@googlegroups.com

> Hi,
> I'm now going to send messages to the Google group address instead of
> individual emails...
>
> Titus and Namshin have raised some excellent questions about expanding
> the NLMSA interfaces for new ways of searching. These seem like great
> ideas, and should be fairly easy to provide. The real question is
> what we want. I'll try to summarize some possibilities, and give some
> background on what's easy vs. hard

OK, great! If you expect quick responses, though, you're going to have to
write shorter and dumber e-mails ;).

> Possible new interfaces:
> - joins: if you want to join two entire datasets, there's a huge
> performance gain for doing it inside pyrex / C as a single method call
> from Python (instead of using Python to iterate over one dataset as
> individual queries to the other dataset). My feeling is that this
> should be a core capability of NLMSA.

Yes please! Although see 'join' vs 'intersection' below.

> - iterator: "walk" the stored intervals / annotations in order on the
> chromosome. Again, this seems like it should be a core capability.

Yes please! Something like this?

for x in nlmsa[interval].after:
...

for x in nlmsa.after(interval):
...

and then for 'before' you could just do something like this,

for x in nlmsa.after(-interval):
...

right?

(You have a much better design sense for this kind of thing than I do so I'd
like to just try out whatever you suggest. Not that I won't dissent, of
course, but that's part of the fun!)

> - closely related issues like "find nearest neighbor" or "distance to
> nearest neighbor", all flow from the same capability.

Yep. Given the iterator, above, I think these might not even need to be
methods; you could just do

x = nlmsa.after(interval).next()

although I forget if there are situations in which that will raise a
StopIteration. Hmm, I think not, if a generator/iterator is returned by
'after'.

> - different kinds of queries: Right now there is only one kind of
> query ("find overlaps"). Other possibilities: "find exact match",
> "find contained intervals", "find intervals containing", etc. the 13
> Allen interval logic relations. "find N neighbors".

Yep. I think being able to gather contained and containing intervals would be
useful in some circumstances; maybe we can mock out this stuff in Python to try
out the interface?

> - we could integrate filtering into querying: Right now, filtering is
> a NLMSASlice method, which means you do it AFTER querying. If the
> filter removes a huge volume of results, it clearly would be better to
> do that during querying. E.g. I only want to get results that map to
> a given target genome or region, etc.

Yep.

What about providing a callback mechanism for filtering, and maybe a wrapped
NLMSA object that implicitly applies the filter?

def filter_fn(interval, information):
# ... return True, False ...

filtered_nlmsa = nlmsa.filter_with(filter_fn)

# now filtered_nlmsa only provides key/value pairs for which
# filter_fn is True.

> - dynamic (read-write) NLMSA: lets you add new alignments /
> annotations at any time. This could either be done using the existing
> storage (a new rebuild() method would minimize the time required by
> limiting the build just to the parts that changed), or creating a new
> storage that is truly tree-based.

Seems like an optimization that I don't need myself just yet, but if it's
relatively easy, then it would be great.

> - reduced memory requirement for building: I'm not sure if Namshin or
> Titus feel this is an important issue.

Ditto.

> - joins: the basic operation is to compose two maps A->B and B->C to
> obtain a map of A->C.
> Implementing joins is more work than the other features above. I'll
> think about exactly what's involved before trying to give a summary of
> the effort required.

Well, I would like *intersections* as well as joins, e.g. composition of two
maps X:A->B and Y:A->B such that I can quickly and easily get features x in X
such that x is also in Y.

> Titus, can you clarify precisely what benefit you're hoping to get
> from altering build() behavior? Do you want to
> * have a better error message (if you think the current message is
> confusing)?
> * have NLMSA automagically do the build for the user, instead of
> raising an exception? (this could be accompanied by a warning message
> explaining the situation a la "you really should have called build()
> first, but we'll build it for you anyway... this may take a little
> while... in the future please call build() first. To hide this
> message, set verbose=False".
> * something else?

First, one note -- printing stuff out to the user by default is IMO a no-no. I
would prefer to use either warnings or logging to communicate with the user.
This provides us with a simple way to turn messages on and off globally...
Perosnally I hate the logging module but I must concede that it's not a bad
solution for this.

OK, back to the build() stuff. Right now, this gives a confusing
result:

>>> for v in annot_db.values():
... annot_map.addAnnotation(v)
>>> print repr(annot_map[interval])
<pygr.nlmsa_utils.BuildMSASlice object at 0x84d13ec>

as opposed to what you get if you do a build first:

>>> annot_map.build()
>>> print repr(annot_map[interval])
<pygr.cnestedlist.NLMSASlice object at 0x84cf194>

This is confusing and Bad, because you can't do the same things to a
BuildMSASlice that you can to an NLMSASlice, and the resulting errors
aren't very informative.

One way of dealing with this (#1) is to return an error message:

>>> for v in annot_db.values():
... annot_map.addAnnotation(v)
>>> print repr(annot_map[interval])
Traceback (most recent call last):
...
TypeError: call build() before indexing an NLMSA!

An alternative (#2) is to 'auto-build' as you propose.

Another alternative (#3) is to expose different Python interfaces to the NLMSA,
pre-build and post-build:

>>> for v in annot_db.values():
... annot_map.addAnnotation(v)
>>> print repr(annot_map[interval])
Traceback (most recent call last):
...
TypeError: NLMSABuildContainers are unsubscribtable; call build() to get an NLMSA that you can query.

>>> built_map = annot_map.build()
>>> print repr(built_map[interval])

I think I like #1 the most, come to think of it, followed by #2.

The only real advantage to #3 is that you might be able to do an efficient job
of storing pre-build NLMSAs if you can use a different data structure,
pre-build. Not sure about the internals, though, and in any case you can
probably implement that underneath the current interface, anyway.

So in case that's not clear: let's just put an error message on
subscripting an unbuilt NLMSA :)

cheers,
--titus

Christopher Lee

unread,

Apr 30, 2008, 1:39:34 AM4/30/08

to pygr...@googlegroups.com

On Apr 29, 2008, at 7:24 PM, C. Titus Brown wrote:

>
> OK, back to the build() stuff. Right now, this gives a confusing
> result:
>
>>>> for v in annot_db.values():
> ... annot_map.addAnnotation(v)
>>>> print repr(annot_map[interval])
> <pygr.nlmsa_utils.BuildMSASlice object at 0x84d13ec>
>
> as opposed to what you get if you do a build first:
>
>>>> annot_map.build()
>>>> print repr(annot_map[interval])
> <pygr.cnestedlist.NLMSASlice object at 0x84cf194>
>
> This is confusing and Bad, because you can't do the same things to a
> BuildMSASlice that you can to an NLMSASlice, and the resulting errors
> aren't very informative.

Hi Titus,
this is part and parcel of the whole concept of a NLMSA as a graph
interface. You add (write) alignments to an NLMSA via the same dict-
based interface that you use for querying (reading) alignments:
msa += ival1 # add ival1 as a node in the graph
msa[ival1][ival2] = None # add an edge from ival1 to ival2

It is this syntax that implies you can query using

for ival2 in msa[ival1]:
do something...

Thus, when an NLMSA is being built, msa[ival1] is not only a legal
operation, but a crucial one. It returns a BuildMSASlice object that
does nothing but handle requests to add an edge such as illustrated
above. It certainly doesn't support query methods like NLMSASlice
does, since we don't support read-write NLMSA now.

I agree that BuildMSASlice gives no useful error messages. I think
the simple answer is to make it give appropriate error messages for
all the possible operations, e.g. if you try to iterate over a slice

for ival2 in msa[ival1]:
do something

Traceback (most recent call last):
...

NLMSAError: you must build this NLMSA before attempting to search it,
by calling its build() method!

It may be that convenience methods like addAnnotation(),
readMAFfiles() and readAxtNet() are making it easy to forget that you
build an NLMSA almost exactly like a dictionary (well, a two-level
dictionary, to represent a graph), and in a way that is syntactically
consistent with how you query it... (again, just like a dictionary...)

With the two level dictionary interface as a graph representation, one
disadvantage is that both building (adding an edge) and querying
(searching for edges from an origin node) involve doing a "first-
level" slice...

Namshin Kim

unread,

May 3, 2008, 5:11:39 AM5/3/08

to pygr...@googlegroups.com

Hi Chris,

I think I have to summarize what I have in mind. Mostly, I want find_nearest_neighbor and walk operations in annotation database, but it could be easily expanded to general NLMSA interface.

According to Alex's NCList paper, it first finds the top level and iterate over sub-intervals included in them. Because many of intervals (annotations) can be overlapped, we may not have good performance if we search and iterate over whole sub-intervals (again). And there could be another issue whether we should return only "closest" hit or all hits. If next annotations have same "start" position, we should return them all in general.

But, because top level intervals (which have plenty of sub-intervals) do not overlap each other, we can just return next start position (top level NLMSA.start has NLMSA.stop, right? Or it could be the distance to next top level interval). Then, we can use standard pygr iteration operation in order to retrieve all sub-intervals in them. First and last top level intervals can have -1 value meaning there is no intervals further. We can add walk operation easily by just wrapping find_nearest_neighbor operation.

By current version of pygr, user can do all of above things in annotation database by simply adding several attributes to annotation objects such as Annot.cluster_id (as a top level interval), Annot.cluster_start/stop, Annot.cluster_previous_start/stop, Annot.cluster_next_start/stop. But it requires pre-processing of intervals. It could be much better if pygr supports them.

Yours,

Namshin Kim

Reply all

Reply to author

Forward

0 new messages