pygr development plan

Christopher Lee

unread,

Oct 19, 2009, 5:04:16 PM10/19/09

to Pygr Development Group

Hi,
Now that Pygr 0.8 is released, this is a good time to take stock of
where we are and where we want to go next. I would like to propose
the following topics for a high-level discussion on some changes to
Pygr development. I also propose that we discuss these at our next
developer conference call. Meanwhile, chime in with your thoughts
about this here in the Discussion Group.

1. Snipping away Pygr's "dark corners"
Software tends to follow an approximate 80-20 rule: a small fraction
of the code and features gets most of the usage and therefore most of
the attention during bug fixing and documentation. The rest of the
code is like "dark corners" where almost no one ever goes, and no
light ever shines. It's typically not well-documented and not well
tested; the bottom line is that nobody is completely sure what's
lurking down there. Pygr has plenty of dark corners, just like any
other project, and this should not be unexpected since I originally
advertised it as a "rapid prototyping" project.

However, Pygr is maturing, and it's time to move all these "dark
corners" out of the master branch, into deprecated or experimental
branches. This is a more limited form of code review than Titus
carried out for seqdb and blast, which involved heavy-duty
refactoring. Here I am just talking about relabeling a fraction of
the codebase as "experimental". Thanks to git, the small fraction of
users who may really want to access those features will easily be able
to do so, but the master branch will benefit from being cleaned up of
all messy, half-baked code.

2. Pygr XB ("eXperimental Branches")
We've already decided to switch to a rapid "point release" schedule in
which we'll release smaller increments of new features than the "big
gulp" releases of 0.7 and 0.8. I want to make clear how that should
actually work: each of us can initiate an "experimental branch" that
solves a problem we have; once it's done, reviewed and documented, it
can become a point release. We can each be working on different
experimental branches in parallel, using git / github to collaborate
on a given experimental branch; as each matures, it turns into a point
release. Some examples that immediately come to my mind:

- integrating Titus' fast sequence database for short reads;

- currently, MySQLMetabase is tied to MySQL. It should be changed to
just use our standard SQLTable interface, which would make it work
with sqlite (and whatever other databases that SQLTable is made to
work with). Furthermore, we could then switch the default for local
file metabase storage to use SQLMetabase with sqlite, eliminating our
dependency on shelve (which has caused so many problems already, and
is likely to become even more of a problem due to the elimination of
bsddb from the Python Standard Library in Python 2.7, Python 3.0
etc.). Storing metabases in a real database will greatly facilitate
development of powerful worldbase management interfaces in the future.

- make SQLTable work with SQLAlchemy. Paul Rigor started work on this
already...

- ensembl annotation support via UCSC databases. Jenny and Marek are
already investigating this.

- more tutorials as people have requested! I prefer to view this as
an experimental branch leading to a point release, since it is
possible that working on tutorial examples may involve bug fixes and
other code changes.

These are just examples. I'm sure you have your own ideas for what
new capabilities you'd like to see.

-- Chris

C. Titus Brown

unread,

Oct 26, 2009, 12:14:54 AM10/26/09

to pygr...@googlegroups.com

On Mon, Oct 19, 2009 at 02:04:16PM -0700, Christopher Lee wrote:
> Now that Pygr 0.8 is released, this is a good time to take stock of
> where we are and where we want to go next. I would like to propose
> the following topics for a high-level discussion on some changes to
> Pygr development. I also propose that we discuss these at our next
> developer conference call. Meanwhile, chime in with your thoughts
> about this here in the Discussion Group.

[ ... ]

> 1. Snipping away Pygr's "dark corners"

[ ... ]

> 2. Pygr XB ("eXperimental Branches")

I really like the idea of figuring out how to effectively have multiple
branches, or a faster evolving subset in the code base, or ...

My main specific interest at this point is generating a set of tutorials
and benchmarking code for next-gen sequence analysis. I think a lot
of the basic needs are fairly clear, and I hope to release code over the
next month to address some of them, but the two big issues on my mind
are these:

first, the memory consumption "bug" while building NLMSAs with lots of
sequences is really problematic. I need to track it down and terminate it with
extreme prejudice. A quick-and-easy reproducible bug would be a good start
here, sigh.

second, I have yet to come up with a mentally workable module for
paired-end representation. Chris, can your graph-fu be applied to
representing paired-end reads mapped onto an existing genome?

thanks,
--titus
--
C. Titus Brown, c...@msu.edu

Christopher Lee

unread,

Oct 26, 2009, 3:31:19 PM10/26/09

to pygr...@googlegroups.com

On Oct 25, 2009, at 9:14 PM, C. Titus Brown wrote:

> first, the memory consumption "bug" while building NLMSAs with lots of
> sequences is really problematic. I need to track it down and
> terminate it with
> extreme prejudice. A quick-and-easy reproducible bug would be a
> good start
> here, sigh.

based on your description, it sounds like this should be easy to track
down:
- it is just in the NLMSA building code, which is one small part of
the NLMSA code;
- it is specifically associated with the number of sequences. Only a
few data structures correlate with the number of sequences.

I think we may be able to find this by just looking at the code...
Just to make sure, you're talking about a memory leak, where the
completion of the construction phase fails to free some memory?

>
> second, I have yet to come up with a mentally workable module for
> paired-end representation. Chris, can your graph-fu be applied to
> representing paired-end reads mapped onto an existing genome?

Let me see if I understand what you want. You have pairs of sequences
that you expect to map some approximate distance from each other
(possibly overlapping). If the sequences were non-overlapping, you
could in principle just concatenate them, and then NLMSA's existing
group-by operators would allow you to filter for mappings of the
concatenated pair that are within a certain distance of each other
(maxinsert option, I think). That would give you what you want.
Presumably your real concern is how to scale this up to a single, fast
join operation that will find all pairs that map within a certain
distance of each other, from the entire dataset. Is that right? For
sequence pairs that do overlap (and which you therefore want to store
as two separate sequences), it seems like the query for a single pair
is pretty straightforward, and again the scalability challenge is how
to turn this into a fast join operation over the whole dataset. Is
that your question?

-- Chris

C. Titus Brown

unread,

Oct 26, 2009, 11:39:11 PM10/26/09

to pygr...@googlegroups.com

On Mon, Oct 26, 2009 at 12:31:19PM -0700, Christopher Lee wrote:
> On Oct 25, 2009, at 9:14 PM, C. Titus Brown wrote:
>
> > first, the memory consumption "bug" while building NLMSAs with lots of
> > sequences is really problematic. I need to track it down and
> > terminate it with
> > extreme prejudice. A quick-and-easy reproducible bug would be a
> > good start
> > here, sigh.
>
> based on your description, it sounds like this should be easy to track
> down:
> - it is just in the NLMSA building code, which is one small part of
> the NLMSA code;
> - it is specifically associated with the number of sequences. Only a
> few data structures correlate with the number of sequences.
>
> I think we may be able to find this by just looking at the code...
> Just to make sure, you're talking about a memory leak, where the
> completion of the construction phase fails to free some memory?

Yep. After some judicious whacking of caches and valgrinding to find out what
*type* of object was growing without bound -- a dict -- it turns out to have
been a problem in nlmsa_utils.py, line 140:

if not hasattr(seq,'annotationType'): # don't cache annotations
dict.__setitem__(self, seq.pathForward, v) # cache this result

This adds seq.pathForward into the nlmsaSeqDict internal cache, which never
expires.

This is fine for small numbers of sequences, but when you've got 2.5 mn small
reads, well, that dictionary gets big ;). We should add a megatest for
short reads, I think.

Simply disabling that line lets me map 2.5 mn short reads in 14 mb of memory,
as opposed to the > 10 gb that it was taking before. Much faster without
the swapping!

I'm not sure if this is a good fix, but I think a default limit of 10 or 100k
entries would be good -- my memory usage stays under 200 mb for 100k
references, which seems appropriate, and we can make it configurable for people
who need to engage with more than 100k sequences in NLMSAs.

Comments or thoughts? I volunteer to implement the fix for 8.1, esp if I can
refactor to use DictMixin and ListMixin for the two classes at the top of
nlmsa_utils...

cheers,

C. Titus Brown

unread,

Oct 26, 2009, 11:44:08 PM10/26/09

to pygr...@googlegroups.com

On Mon, Oct 26, 2009 at 12:31:19PM -0700, Christopher Lee wrote:

> > second, I have yet to come up with a mentally workable module for
> > paired-end representation. Chris, can your graph-fu be applied to
> > representing paired-end reads mapped onto an existing genome?
>
> Let me see if I understand what you want. You have pairs of sequences
> that you expect to map some approximate distance from each other
> (possibly overlapping). If the sequences were non-overlapping, you
> could in principle just concatenate them, and then NLMSA's existing
> group-by operators would allow you to filter for mappings of the
> concatenated pair that are within a certain distance of each other
> (maxinsert option, I think). That would give you what you want.
> Presumably your real concern is how to scale this up to a single, fast
> join operation that will find all pairs that map within a certain
> distance of each other, from the entire dataset. Is that right? For
> sequence pairs that do overlap (and which you therefore want to store
> as two separate sequences), it seems like the query for a single pair
> is pretty straightforward, and again the scalability challenge is how
> to turn this into a fast join operation over the whole dataset. Is
> that your question?

I'm not completely sure of how I want to use paired-end reads, but PE
short-read technology is extremely useful for indel analysis, which in
turn is going to be very important.

I had already thought of the concatenation solution, and that should work
for finding pairs that are consonant with each other. How about
pairs representing inversions, where one of the pairs points in a
different direction from the other? I guess you could do the same thing
and simply keep a separate list of situations where the pairs map
in the same direction... hmm.

Anyway, "fast" is also going to be very important, given the volume of data
produced by individual groups.

cheers,

Christopher Lee

unread,

Oct 26, 2009, 11:54:37 PM10/26/09

to pygr...@googlegroups.com

On Oct 26, 2009, at 8:39 PM, C. Titus Brown wrote:

>
> On Mon, Oct 26, 2009 at 12:31:19PM -0700, Christopher Lee wrote:
> Yep. After some judicious whacking of caches and valgrinding to
> find out what
> *type* of object was growing without bound -- a dict -- it turns out
> to have
> been a problem in nlmsa_utils.py, line 140:
>
> if not hasattr(seq,'annotationType'): # don't cache annotations
> dict.__setitem__(self, seq.pathForward, v) # cache this result
>
> This adds seq.pathForward into the nlmsaSeqDict internal cache,
> which never
> expires.

Awesome! This also makes me feel better. I thought you meant a bona-
fide memory leak in my C code where some memory is no longer
accessible but has not been freed. The problem you have identified is
not a memory leak, but just a caching policy that is not appropriate
for your specific usage. Indeed, the if statement above seems to be
begging to be generalized, so the user can turn off or limit the
caching if they want.

>
> This is fine for small numbers of sequences, but when you've got 2.5
> mn small
> reads, well, that dictionary gets big ;). We should add a megatest
> for
> short reads, I think.
>
> Simply disabling that line lets me map 2.5 mn short reads in 14 mb
> of memory,
> as opposed to the > 10 gb that it was taking before. Much faster
> without
> the swapping!

Great.

>
> I'm not sure if this is a good fix, but I think a default limit of
> 10 or 100k
> entries would be good -- my memory usage stays under 200 mb for 100k
> references, which seems appropriate, and we can make it configurable
> for people
> who need to engage with more than 100k sequences in NLMSAs.
>
> Comments or thoughts? I volunteer to implement the fix for 8.1, esp
> if I can
> refactor to use DictMixin and ListMixin for the two classes at the
> top of
> nlmsa_utils...

I think the main question is simply to find an appropriate 80-20 rule
for your application, i.e. to have the caching achieve 80% of the
benefits of full caching with only 20% of the space (or whatever)...

-- Chris

C. Titus Brown

unread,

Oct 27, 2009, 11:58:48 AM10/27/09

to pygr...@googlegroups.com

On Mon, Oct 26, 2009 at 08:54:37PM -0700, Christopher Lee wrote:
> On Oct 26, 2009, at 8:39 PM, C. Titus Brown wrote:
>
> >
> > On Mon, Oct 26, 2009 at 12:31:19PM -0700, Christopher Lee wrote:
> > Yep. After some judicious whacking of caches and valgrinding to
> > find out what
> > *type* of object was growing without bound -- a dict -- it turns out
> > to have
> > been a problem in nlmsa_utils.py, line 140:
> >
> > if not hasattr(seq,'annotationType'): # don't cache annotations
> > dict.__setitem__(self, seq.pathForward, v) # cache this result
> >
> > This adds seq.pathForward into the nlmsaSeqDict internal cache,
> > which never
> > expires.
>
> Awesome! This also makes me feel better. I thought you meant a bona-
> fide memory leak in my C code where some memory is no longer
> accessible but has not been freed. The problem you have identified is
> not a memory leak, but just a caching policy that is not appropriate
> for your specific usage. Indeed, the if statement above seems to be
> begging to be generalized, so the user can turn off or limit the
> caching if they want.

Potahto, potaeto -- memory hidden in an inconvenient place, then? ;)

For future reference, valgrind performed very nicely: it told me that
the memory was still accessible, and that it belonged to a dict. This
in turn told me that it was unlikely to be a "true" memory leak, but
rather just belonged to something not being garbage collected.

So I disabled all of the "cache" dictionaries. That didn't help.

I then looked at 'gc.garbage' to figure out whether or not it was an
uncollectable cycle -- and it wasn't. At that point I just had to look at
dictionary after dictionary...

I also used the following code to examine memory stats on Linux at the
end of the script:

---
pid = os.getpid()
fp = open('/proc/%s/status'% pid)
for line in fp:
if line.startswith('Vm'):
print line,
---

This might be a good thing to report for megatests.

> > I'm not sure if this is a good fix, but I think a default limit of
> > 10 or 100k
> > entries would be good -- my memory usage stays under 200 mb for 100k
> > references, which seems appropriate, and we can make it configurable
> > for people
> > who need to engage with more than 100k sequences in NLMSAs.
> >
> > Comments or thoughts? I volunteer to implement the fix for 8.1, esp
> > if I can
> > refactor to use DictMixin and ListMixin for the two classes at the
> > top of
> > nlmsa_utils...
>
> I think the main question is simply to find an appropriate 80-20 rule
> for your application, i.e. to have the caching achieve 80% of the
> benefits of full caching with only 20% of the space (or whatever)...

I'm more interested in choosing a good default so that this doesn't
bite anyone else! Does 100k sound good?

We can also provide a simple configuration mechanism & document it, so
people are aware of where memory can go astray.

Reply all

Reply to author

Forward