iterator access to NLMSA intervals

Alexander Alekseyenko

unread,

Mar 11, 2009, 11:57:12 AM3/11/09

to pygr-dev

Dear All,

I couldn't find this documented, but is there a way to get iterator
access to the matchIntervals of an NLMSA slice? Also is there a way to
make NLMSA slice just get an iterator into the NLMSA object query,
which it may execute at a later time as needed? The reason I'm asking
is that when working with large alignments (like for mapping of next
gen. sequencing datasets) when one queries on the reference NLMSA goes
in and takes out ALL the matches before returning. This is a usability
concern since response time is very slow. I would rather prefer to get
an iterator into the results and have my downstream processing work
with that so that I can see what results I'm getting as early as
possible. Any ideas on how to achieve this behavior?

Cheers,
Alex

Christopher Lee

unread,

Mar 12, 2009, 5:17:29 PM3/12/09

to pygr...@googlegroups.com

Hi Alex,
I am reading between the lines of your question; let me know if I'm
misunderstanding. If you have a single query that generates a huge
result set, regardless of whether it gets the results in one step or
via an iterator, getting them all will be slow. But if you intend to
join that query with another query, such that their intersection would
be a much smaller final result set, then it could be implemented much
faster than our current mechanism (which fetches the entire first
result set before doing anything further). Specifically, the two
query results can be joined via a linear scan, which could be really
fast in C.

So it sounds like what you really want is true JOIN operators (joining
across two or more queries), and that is exactly what we are starting
to work on next. It would be great if you're interested in helping to
define exactly what fast JOIN operators are needed first.

Yours,

Chris

Alexander Alekseyenko

unread,

Mar 12, 2009, 6:21:04 PM3/12/09

to pygr-dev

Thanks Chris.
What you describe would be nice to have. However, I have a much simple
problem in mind. Suppose we have a server that connects to NLMSA
database to retrieve the results for a requesting client. Imagine the
server returns an enormous result set. If we first go in and fetch the
whole result set from the database before serving, then we are likely
to timeout on the connection with the requesting client. If instead we
get an iterator to the result set, we can start serving the client
immediately with much smaller chance of timeout.

Here's how this helps in my case. Suppose we are doing an exploratory
analysis of the data where you need to quickly check if an idea would
work. You need to run a query against the database and in order to
confirm the idea works you only need a small portion of the result
set. In other words what I need is a LIMIT modifier on a query. Having
an iterator access makes limiting trivial. Hence my question.

As for the JOIN, I think it deserves a whole new discussion thread.
Let's start it!

Cheers,

Alex

Christopher Lee

unread,

Mar 12, 2009, 6:58:14 PM3/12/09

to pygr...@googlegroups.com

Answers below...

On Mar 12, 2009, at 3:21 PM, Alexander Alekseyenko wrote:

>
> Thanks Chris.
> What you describe would be nice to have. However, I have a much simple
> problem in mind. Suppose we have a server that connects to NLMSA
> database to retrieve the results for a requesting client. Imagine the
> server returns an enormous result set. If we first go in and fetch the
> whole result set from the database before serving, then we are likely
> to timeout on the connection with the requesting client. If instead we
> get an iterator to the result set, we can start serving the client
> immediately with much smaller chance of timeout.

It sounds like you're talking about an XMLRPC client-server
connection. The use of an iterator in this context is a little funny,
because XMLRPC is not designed for the sort of persistent connection
that an iterator would require. It could be done, but we'd have to
define some very clear rules for letting the server purge old
iterators that clients have not properly released. More problematic
is the issue of multiple concurrent queries. Currently the NLMSA
server completes a query before beginning another. In your
"persistent iterator" model, the server has to hold incomplete
iterators open while continuing to process new queries. The internals
of intervaldb are in fact iterator-based, so it could do that, but it
certainly would add complexity to the server code.

It would be a LOT easier to do a straight LIMIT query just like SQL
does (see below). I suspect the above complexities may be why SQL
provides only a LIMIT mechanism, and no "progressive results" iterator
like you are asking for.

>
>
> Here's how this helps in my case. Suppose we are doing an exploratory
> analysis of the data where you need to quickly check if an idea would
> work. You need to run a query against the database and in order to
> confirm the idea works you only need a small portion of the result
> set. In other words what I need is a LIMIT modifier on a query. Having
> an iterator access makes limiting trivial. Hence my question.

we could implement a LIMIT clause very easily, right away, both for
local and XMLRPC queries. That seems like a great idea.

Alexander Alekseyenko

unread,

Mar 12, 2009, 8:11:03 PM3/12/09

to pygr-dev

I wasn't actually referring to XMLRPC client-server specifically (I
don't see an immediate utility in "persistent iterators"). Rather, I'm
concerned that the current setup is contrary to the general pygr
ideology of delayed execution. This wasn't an issue before I started
working large datasets (next gen. sequencing). Here I see a dramatic
degradation of _responsiveness_ of NLMSA. I have to sit and wait for
about 2 minutes before any results start to emerge because they have
to be _all_ collected before I see any. This is why I'd like to have
an iterator.
This may also apply to XMLRPC. If you are serving over a slower
connection, the client waits for the query to complete and then waits
for data transfer over the network. With an iterator you can send
results as they emerge using non-blocking network writes, hence
generally reducing the amount of time the client waits before the
result is received due to pipelining.

Christopher Lee

unread,

Mar 12, 2009, 9:51:50 PM3/12/09

to pygr...@googlegroups.com

How large are the result sets that are taking so long, in terms of
numbers of intervals?

-- Chris

Alexander Alekseyenko

unread,

Mar 12, 2009, 9:59:33 PM3/12/09

to pygr-dev

a few hundred thousands.

Christopher Lee

unread,

Mar 13, 2009, 12:32:08 AM3/13/09

to pygr...@googlegroups.com

I don't think it should take 2 minutes to load that much data. 300,000
* 24 = 7 MB. That can be read in a fraction of a second. Is it
possible that your system is somehow going into virtual memory
swapping or some other very slow state? Otherwise, we need a
reproducible for your performance problem, so we can debug it.
-- Chris

On Mar 12, 2009, at 6:59 PM, Alexander Alekseyenko wrote:

>
> a few hundred thousands.

Alexander Alekseyenko

unread,

Mar 13, 2009, 2:12:27 PM3/13/09

to pygr-dev

You are right, the actual query is fast ~6 sec. but getting all the
intervals using NLMSASlice.matchIntervals() takes over 2 minutes. I
guess this call also includes all the necessary python-level object
creation, which can be slow.

Kenny Daily

unread,

Mar 13, 2009, 7:04:12 PM3/13/09

to pygr-dev

I actually have a similar problem of slow loading times for a large
number of intervals in an NLMSA (10^6). I have the sample code and
input file here (Beware, file size is 26MB):

http://www.ics.uci.edu/~baldig/pygr_nlmsa_test.tar.gz

Input file (bowtie_mapped_reads.txt) is 1 million reads from bowtie
mapping. All code has been added to one file necessary for parsing,
reading, and inserting the data into pygr (pygr_nlmsa_test.py). Using
"time" module, I get a loading time from pygr.Data.getResource(...) of
~50 sec. Looking at usage with "top" shows:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TIME DATA
COMMAND
32134 kdaily 25 0 361m 268m 4336 S 0 1.7 1:29.49 1:29
264m ipython
...
Mem: 16475040k total, 1745548k used, 14729492k free, 70924k
buffers
Swap: 65537156k total, 6812k used, 65530344k free, 1126804k
cached

Also, any hints on better ways of implementing this kind of
functionality are appreciated!

Thanks,

Kenny

Christopher Lee

unread,

Mar 16, 2009, 5:02:20 PM3/16/09

to pygr...@googlegroups.com

Thanks, Alex, that clarifies the problem tremendously. It would be
much easier to give you a true iterator access to the NLMSASlice
results (than it would have been to hold open the disk-access
iterators). For match_intervals it's pretty trivial. By the way, you
can just take the len() of an NLMSASlice, I think this returns the
number of aligned sequences.

-- Chris

Reply all

Reply to author

Forward