First pass seqdb review FINISHED

C. Titus Brown

unread,

Mar 14, 2009, 9:14:36 PM3/14/09

to pygr...@googlegroups.com

Hi all,

I'm done with my first pass through seqdb.py. The results are available
on github; as before, I changed only seqdb.py and seqdb_test.py.

---

You can see seqdb.py here,

http://github.com/ctb/pygr/blob/6f73203f6da3d85e24a958911765c07a896ce1de/pygr/seqdb.py

and the new test diff here,

http://github.com/ctb/pygr/commit/f4fc1d59025859d33b79c4b160697c645215059e#diff-3

---

I didn't find any more bugs -- darn it ;).

These changes increased the code coverage by about 20% to 85%, which is
getting pretty respectable!

Review comments follow. Comments solicited; where I suggest backwards
incompatible changes, I don't see any likely breakages within pygr
itself, but I don't know what would happen within existing software
that uses pygr, so please let me know if there's a good reason to
keep things the way they are.

(If you want to just browse the source file, look for '@CTB' comments.)

After dealing with the issues below, and maybe figuring out how to test
the caching, I'll start writing up more documentation.

General comments
----------------

The class names in this file still confuse me. Do we have any rules
on how to build names for classes in inheritance hierarcy? e.g.

FileDBSequence and SequenceFileDB -- which one is the sequence db,
which one is the sequence??

The invert behavior is confusing (__invert__ points to
classutil.standard_invert which then redirects to self._inverse,
creating it from self._inverseClass if necessary). Can we use
descriptors or something "simpler" here?

Have we decided anything about logging, as Istvan established in his
branch? I'd like to get rid of explicit writes to sys.stderr, in
particular.

I'd like to establish some documentation or tests on exactly what
db.seqInfoDict should do. So far I know that it should contain info
objects with a length attribute, for NLMSAs to use; and an id and
sequence, presumably. What else? Ultimately it'd be good to
establish a simple executable test for "proper" seqInfoDict behavior.

Specific issues
---------------

I don't like some aspects of the 'ifile' handling in SequenceFileDB.
In particular, I dislike the part where it re-opens the file to
re-read it and then closes 'ifile'; this may not be expected behavior
('ifile' itself is only used to grab the filename!) and I don't like
closing user-passed-in file handles. Plus, pulling the filename out
of a file handle disallows StringIO and other "file-like" handles such
as sockets... Is the ifile mechanism actually used anywhere?

The 'set_seqtype' function on SequenceDB depends on things being a FASTA file
instead of just iterating. Why?? Can we remove that?

Also, 'set_seqtype' returns the sequence type as well as setting it on
the db object. This behavior isn't used anywhere in the code base. Can
we trash it?

In fact, it seems like 'set_seqtype' is used only within SequenceDB
and its children. Why not rename it to be private, i.e. prepend it
with '_'?

Should SequenceDB._cacheMax be configurable? Right now it's hardcoded.

I don't know what idFilter does in _store_seqlen_dict, and it seems
like it would be easier to just implement that in your own
seqfmt-based reader, anyway. Can we trash it? It's not used anywhere
in the code base.

The 'Sequence' classes in seqdb (e.g. FileDBSequence) should never be
created directly by a user, right? Should we make them private by
prepending a '_'?

SeqPrefixUnionDict is oddly named.

PrefixUnionDict could contain __iadd__ behavior (instead of
SeqPrefixUnionDict, which inherits from it solely to add __iadd__
behavior). Why not do things this way?

There's an 'if' statement in SeqPrefixUnionDict.__iadd__ that can
never be reached, as far as I can tell. Could someone (Chris) please
verify that for me?

In two places (SeqPrefixUnionDict.__iadd__, find @CTB untested; and
_PrefixUnionDictInverse.__getitem__, find @CTB abstraction) explicit
knowledge of annotation objects is assumed in classes that otherwise
don't know anything special about them. Any way to deal with this
better or should I just suck up and deal & write some tests?

A few PrefixUnionDict issues:

- PrefixUnionDict.__contains__ can take either a sequence ID or a
sequence object. This is inconsistent with __getitem__ (which only
takes seq IDs). Ugly.

- PrefixUnionDict.__init__ takes an argument 'trypath' that is really a
list of paths to try. Could we rename this to 'trypaths'? Or should
I just suck up and deal & write a nice doc?

- PrefixUnionDict.writeHeaderFile function (etc.) isn't used anywhere
in the code. How useful is it?

- PrefixUnionDict.newMemberDict(...) and the associated
_PrefixUnionDictMember class don't have a clear purpose to me.
Could someone give me a concrete use case? If not, maybe they should
be removed or something; they're confusing!

- _PrefixUnionDictMember.default handling could probably be written
in a more standard way using dict.setdefault, no? How much code depends
on this specific behavior?

The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and
unused in the rest of the code base. I'd appreciate some guidance on
what they do, and perhaps some tests to get started. (@CTB issue)

The SQLSequence stuff at the bottom of seqdb.py could probably go in a
different file, like sqlgraph.py. They don't depend on anything in
seqdb.py.

cheers,
--titus
--
C. Titus Brown, c...@msu.edu

Christopher Lee

unread,

Mar 18, 2009, 1:51:36 PM3/18/09

to pygr...@googlegroups.com

Thanks for all your detailed analysis, Titus! Comments below...

On Mar 14, 2009, at 6:14 PM, C. Titus Brown wrote:

> I didn't find any more bugs -- darn it ;).

Just because you're not paranoid doesn't mean they aren't out to ...
whatever.

>
>
> These changes increased the code coverage by about 20% to 85%, which
> is
> getting pretty respectable!

I'm sure Oscar Wilde said something witty and cutting about
respectability, but it has slipped my mind.

>
> The class names in this file still confuse me. Do we have any rules
> on how to build names for classes in inheritance hierarcy? e.g.

I believe Steve McConnell's books (e.g. Code Complete) cover these
naming convention issues. We could look to see what he says.

> FileDBSequence and SequenceFileDB -- which one is the sequence db,
> which one is the sequence??

I guess I'm just following English, e.g. "sequence database" is a
database; vs. "database sequence" is a sequence. If someone has a
better convention, please describe it.

>
>
> The invert behavior is confusing (__invert__ points to
> classutil.standard_invert which then redirects to self._inverse,
> creating it from self._inverseClass if necessary). Can we use
> descriptors or something "simpler" here?

Please elaborate further; I don't understand what you're suggesting.
The Python standard defines __invert__ as the invert operator method
name; I follow that standard. How can a descriptor help us here?

>
>
> Have we decided anything about logging, as Istvan established in his
> branch? I'd like to get rid of explicit writes to sys.stderr, in
> particular.

Seems like a worthy goal. Has anyone entered an issue for this in the
tracker? I'm not aware that anyone has explicitly proposed this for
0.8.

>
>
> I'd like to establish some documentation or tests on exactly what
> db.seqInfoDict should do. So far I know that it should contain info
> objects with a length attribute, for NLMSAs to use; and an id and
> sequence, presumably. What else? Ultimately it'd be good to
> establish a simple executable test for "proper" seqInfoDict behavior.

Hmm... I certainly don't envision *requiring* a "sequence" attribute.
I made seqInfoDict a standard mechanism for getting information about
a specific sequence ID without having to do the full set of operations
for instantiating a sequence object (which, depending on the backend
implementation, might involve actually retrieving the sequence, and
thus could be slow). The only seqInfoDict value attribute currently
required in Pygr is the length. However you and Alex Nolley pointed
out the value of being able to associate arbitrary attributes with
sequence objects (initially, the FASTA description line), so I
formulated seqInfoDict as a general-purpose mechanism for such
attributes.

I suggest that the list of attributes required on all seqInfoDict
values be minimal -- only what Pygr core code simply can't live
without. (e.g. in the case of NLMSA, it must know the length in order
to pack the coordinate system). Of course, users' code can look for
other seqInfoDict attributes if they need to. We should just define
consistent names, so that different backends will use the same
attribute name for a given kind of data (e.g. free-form text
description like the FASTA description line could be called
"description"). But unless there is a compelling reason why Pygr's
core code needs the "description", I don't think we should add a test
that asserts the absence of that attribute is a bug.

So the executable test you're proposing only needs to look for the
"length" attribute.

>
>
> Specific issues
> ---------------
>
> I don't like some aspects of the 'ifile' handling in SequenceFileDB.
> In particular, I dislike the part where it re-opens the file to
> re-read it and then closes 'ifile'; this may not be expected behavior
> ('ifile' itself is only used to grab the filename!) and I don't like
> closing user-passed-in file handles. Plus, pulling the filename out
> of a file handle disallows StringIO and other "file-like" handles such
> as sockets... Is the ifile mechanism actually used anywhere?

This was used by some coordinator module functionality that is made
obsolete by the much more powerful capabilities of pygr.Data. We can
certainly get rid of this ifile mechanism.

>
>
> The 'set_seqtype' function on SequenceDB depends on things being a
> FASTA file
> instead of just iterating. Why?? Can we remove that?

The reason this exists is historical. At one point Namshin reported
that opening a sequence database had become very slow, and it was due
to the set_seqtype() method invoking the iterator instead of our
original read_fasta_one_line(), which I had removed during our seqdb
refactoring. The seqdb iterator invoked the Python shelve __iter__(),
which turned out to load the entire list of keys into memory before
returning the iterator; very slow for solexa data! To get a fix to
Namshin quickly, I first restored the use of read_fasta_one_line,
which solved the problem. I later was able to solve the shelve
problem; because we had previously found btree to be more scalable
than the default hash-based shelve, we had created our own subclass of
shelve. So I just added a proper __iter__() method to our subclass.

Conclusion: we could again get rid of read_fasta_one_line().

>
>
> Also, 'set_seqtype' returns the sequence type as well as setting it on
> the db object. This behavior isn't used anywhere in the code base.
> Can
> we trash it?

Sounds fine to me.

>
>
> In fact, it seems like 'set_seqtype' is used only within SequenceDB
> and its children. Why not rename it to be private, i.e. prepend it
> with '_'?

Fine by me. Actually, this seems like a topic ripe for code review.
Python is not that zealous about the whole public vs. private
distinction, and I've accordingly been lax about this. We could go
through the code and designate all purely internal methods, then
prefix them with _ .

>
>
> Should SequenceDB._cacheMax be configurable? Right now it's
> hardcoded.

Sure, why not.

>
>
> I don't know what idFilter does in _store_seqlen_dict, and it seems
> like it would be easier to just implement that in your own
> seqfmt-based reader, anyway. Can we trash it? It's not used anywhere
> in the code base.

idFilter addressed the nasty reality that one major database source
(NCBI) abuses their ID field (treating it as a blob into which they
can cram as many junk fields as they want... and then blastall returns
an ID that does not match the original in the file). idFilter allows
you to do any transformations on the input IDs that you want.
However, as a result of the seqdb refactoring, this may no longer be
needed in the sequence file db class itself. All the code for the
NCBI blast ID munging got moved to blast.py and is handled there. We
could get rid of idFilter in _store_seqlen_dict.

>
>
> The 'Sequence' classes in seqdb (e.g. FileDBSequence) should never be
> created directly by a user, right? Should we make them private by
> prepending a '_'?

Throughout pygr we allow users to pass an itemClass to a database
constructor, so the item classes are public. The _init_subclass
pattern tended to modularize all the backend details in the item
class, making the database class very generic (works with any
itemClass). Based on your advice, I've reversed that policy, so that
the backend details are on the database class. But still, I can
easily imagine scenarios in which a user would want to subclass
FileDBSequence, so I don't see these as private.

>
> The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and
> unused in the rest of the code base. I'd appreciate some guidance on
> what they do, and perhaps some tests to get started. (@CTB issue)

I don't think anyone has used these classes in at least a couple of
years. We could delete them.

>
>
> The SQLSequence stuff at the bottom of seqdb.py could probably go in a
> different file, like sqlgraph.py. They don't depend on anything in
> seqdb.py.

That seems like a reasonable idea.

I'll answer the PrefixUnionDict questions in a separate email.

Yours with thanks,

Chris

Alexander Alekseyenko

unread,

Mar 18, 2009, 2:21:51 PM3/18/09

to pygr-dev

Hope it's not too late for this release, but I have one comment.
BlastDB method blast can only process one sequence at a time, which
can be slow due to numerous system calls when blasting a large
database against some other database. I have uploaded to this group my
implementation of processing a whole database in just one invocation
of .blast method (see roche454.tgz:roche.py: blast_processor,
process_library_blast and ReferenceDB.blast_library). The
implementation creates a blast process in a thread that listens to
fasta sequences being piped into it, feeds them into blast and writes
blast results into return pipe. Do you think this can be included in
the release?

Cheers,
Alex

C. Titus Brown

unread,

Mar 18, 2009, 2:23:19 PM3/18/09

to pygr...@googlegroups.com

On Wed, Mar 18, 2009 at 11:21:51AM -0700, Alexander Alekseyenko wrote:
-> Hope it's not too late for this release, but I have one comment.
-> BlastDB method blast can only process one sequence at a time, which
-> can be slow due to numerous system calls when blasting a large
-> database against some other database. I have uploaded to this group my
-> implementation of processing a whole database in just one invocation
-> of .blast method (see roche454.tgz:roche.py: blast_processor,
-> process_library_blast and ReferenceDB.blast_library). The
-> implementation creates a blast process in a thread that listens to
-> fasta sequences being piped into it, feeds them into blast and writes
-> blast results into return pipe. Do you think this can be included in
-> the release?

(...without looking at the code or understanding the issues in
detail...)

I think we might as well try. I will put this on my list of things to
look at this weekend.

thanks,
--titus

C. Titus Brown

unread,

Mar 18, 2009, 3:34:10 PM3/18/09

to pygr...@googlegroups.com

On Wed, Mar 18, 2009 at 10:51:36AM -0700, Christopher Lee wrote:
-> Thanks for all your detailed analysis, Titus! Comments below...

Welcome!

-> > The class names in this file still confuse me. Do we have any rules
-> > on how to build names for classes in inheritance hierarcy? e.g.
->
-> I believe Steve McConnell's books (e.g. Code Complete) cover these
-> naming convention issues. We could look to see what he says.
->
-> > FileDBSequence and SequenceFileDB -- which one is the sequence db,
-> > which one is the sequence??
->
-> I guess I'm just following English, e.g. "sequence database" is a
-> database; vs. "database sequence" is a sequence. If someone has a
-> better convention, please describe it.

So this is more of a pragmatic issue for me than a style question: I
just get confused when reading those names in code. At this point I've
mentally over-analyzed the situation so I can't think clearly about it
any more.

I think FileDB_Sequence and Sequence_FileDB would be clearer to me,
emphasizing that the adjectival associations are (FileDB)+Sequence and
Sequence+(FileDB). But those are ugly names so I don't want to do that.

Is this a mental problem unique to me? Or does anyone else get
confused?

-> > The invert behavior is confusing (__invert__ points to
-> > classutil.standard_invert which then redirects to self._inverse,
-> > creating it from self._inverseClass if necessary). Can we use
-> > descriptors or something "simpler" here?
->
-> Please elaborate further; I don't understand what you're suggesting.
-> The Python standard defines __invert__ as the invert operator method
-> name; I follow that standard. How can a descriptor help us here?

It's not __invert__ itself, it's the optimization that constructs
__invert__ only when needed. I don't think

class SomeClass(...):
__invert__ = classutil.standard_invert
...
_inverseClass = FooBar

is an obvious construct, although I do understand why you defined things
that way. How about,

__invert__ = classutil.lazy_create(FooBar)

instead, for example? Then FooBar would be obviously and directly
connected to __invert__.

-> > Have we decided anything about logging, as Istvan established in his
-> > branch? I'd like to get rid of explicit writes to sys.stderr, in
-> > particular.
->
-> Seems like a worthy goal. Has anyone entered an issue for this in the
-> tracker? I'm not aware that anyone has explicitly proposed this for
-> 0.8.

Istvan? How much work would this be, do you think?

Maybe we could at least put the mechanisms in, and then convert the code
base slowly?

-> > I'd like to establish some documentation or tests on exactly what
-> > db.seqInfoDict should do. So far I know that it should contain info
-> > objects with a length attribute, for NLMSAs to use; and an id and
-> > sequence, presumably. What else? Ultimately it'd be good to
-> > establish a simple executable test for "proper" seqInfoDict behavior.

[ ... info on seqInfoDict ... ]

OK, I'll write this up in some nice way in the module itself, and then
you can check it over. I think I understand seqInfoDict now, at any
rate ;)

-> So the executable test you're proposing only needs to look for the
-> "length" attribute.

[ ... ]

OK. I'll just put such a statement into the documentation on building
your own seqdb implementation, for now.

-> > I don't like some aspects of the 'ifile' handling in SequenceFileDB.
-> > In particular, I dislike the part where it re-opens the file to
-> > re-read it and then closes 'ifile'; this may not be expected behavior
-> > ('ifile' itself is only used to grab the filename!) and I don't like
-> > closing user-passed-in file handles. Plus, pulling the filename out
-> > of a file handle disallows StringIO and other "file-like" handles such
-> > as sockets... Is the ifile mechanism actually used anywhere?
->
-> This was used by some coordinator module functionality that is made
-> obsolete by the much more powerful capabilities of pygr.Data. We can
-> certainly get rid of this ifile mechanism.

OK, that will definitely simplify the code -- fantastic!

-> > The 'set_seqtype' function on SequenceDB depends on things being a
-> > FASTA file
-> > instead of just iterating. Why?? Can we remove that?

[ ... __iter__ problem on bsddb ... ]

-> Conclusion: we could again get rid of read_fasta_one_line().

OK, great!

-> > Also, 'set_seqtype' returns the sequence type as well as setting it on
-> > the db object. This behavior isn't used anywhere in the code base.
-> > Can
-> > we trash it?
->
-> Sounds fine to me.

Great!

-> > In fact, it seems like 'set_seqtype' is used only within SequenceDB
-> > and its children. Why not rename it to be private, i.e. prepend it
-> > with '_'?
->
-> Fine by me. Actually, this seems like a topic ripe for code review.
-> Python is not that zealous about the whole public vs. private
-> distinction, and I've accordingly been lax about this. We could go
-> through the code and designate all purely internal methods, then
-> prefix them with _ .

I'll just assume this as part of the module-level code review, then. It
works well with systematically improving code coverage analysis, as you
can quickly discover where you forgot compensatory _ prefixes when you
have near-100% code coverage...

-> > Should SequenceDB._cacheMax be configurable? Right now it's
-> > hardcoded.
->
-> Sure, why not.

ok

-> > I don't know what idFilter does in _store_seqlen_dict, and it seems
-> > like it would be easier to just implement that in your own
-> > seqfmt-based reader, anyway. Can we trash it? It's not used anywhere
-> > in the code base.
->
-> [ ... ] We could get rid of idFilter in _store_seqlen_dict.

Excellent!

-> > The 'Sequence' classes in seqdb (e.g. FileDBSequence) should never be
-> > created directly by a user, right? Should we make them private by
-> > prepending a '_'?
->
-> Throughout pygr we allow users to pass an itemClass to a database
-> constructor, so the item classes are public. The _init_subclass
-> pattern tended to modularize all the backend details in the item
-> class, making the database class very generic (works with any
-> itemClass). Based on your advice, I've reversed that policy, so that
-> the backend details are on the database class. But still, I can
-> easily imagine scenarios in which a user would want to subclass
-> FileDBSequence, so I don't see these as private.

OK, makes sense. I was -0 on doing this and you've explained why ;)

-> > The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and
-> > unused in the rest of the code base. I'd appreciate some guidance on
-> > what they do, and perhaps some tests to get started. (@CTB issue)
->
-> I don't think anyone has used these classes in at least a couple of
-> years. We could delete them.

OK, great!

-> > The SQLSequence stuff at the bottom of seqdb.py could probably go in a
-> > different file, like sqlgraph.py. They don't depend on anything in
-> > seqdb.py.
->
-> That seems like a reasonable idea.

I'll try it out & see what happens...

Istvan Albert

unread,

Mar 18, 2009, 4:06:57 PM3/18/09

to pygr-dev

On Mar 18, 3:34 pm, "C. Titus Brown" <c...@msu.edu> wrote:

> -> > Have we decided anything about logging, as Istvan established in his
> -> > branch? I'd like to get rid of explicit writes to sys.stderr, in
> -> > particular.
> ->
> -> Seems like a worthy goal. Has anyone entered an issue for this in the
> -> tracker? I'm not aware that anyone has explicitly proposed this for
> -> 0.8.
>
> Istvan? How much work would this be, do you think?

There is an issue in the tracker for this:

http://code.google.com/p/pygr/issues/detail?id=61

I'd say that to get it incorporated the logger module now currently
under tests/testlib/logger.py would need to be copied/moved to the
main pygr package, then each sys.stderr message would need be replaced
with a call to logger.debug(), info(), warn() or error().

Probably it's best if only one module is migrated at once, to see how
it works out in practice.
The issue above also mentions the verbosity of these messages that
would be best to be made shorter. It is fairly straightforward to do,
all it needs some on-the-fly decision making on what severity level
corresponds to each of the messages. (severity levels rise in the
order listed above).

The way I typically do it is that I have debug messages be littered
throughout the code but these are suppressed by default and get turned
on only when testing with higher verbosity modes (1 or 2). Info and
above are printed by default.

Istvan

Christopher Lee

unread,

Mar 18, 2009, 4:43:14 PM3/18/09

to pygr...@googlegroups.com

On Mar 18, 2009, at 12:34 PM, C. Titus Brown wrote:
> I don't think
>
> class SomeClass(...):
> __invert__ = classutil.standard_invert
> ...
> _inverseClass = FooBar
>
> is an obvious construct, although I do understand why you defined
> things
> that way. How about,
>
> __invert__ = classutil.lazy_create(FooBar)
>
> instead, for example? Then FooBar would be obviously and directly
> connected to __invert__.

OK, I see how this is clearer in a single line than setting __invert__
and _inverseClass in two consecutive lines in my original code. Your
lazy_create() would return a callable function that behaves like my
existing standard_invert. Do you want to implement this, or should I?

-- Chris

Christopher Lee

unread,

Mar 18, 2009, 4:58:06 PM3/18/09

to pygr...@googlegroups.com

Hi Alex,
how much faster is your library processor than iterating over blasting
all the sequences in the simple-minded approach?

I suspect that incorporating your library blast processor into the 0.8
release may be complicated by the fact that BlastDB itself is now
deprecated. We refactored all the blast support functionality into a
separate module (pygr.blast); BlastDB is retained only for backward
compatibility. I'll have to look at your code to see exactly what
it's assuming and how easily this could be "ported" to our new
BlastMapping construct. I guess your processor could be made into a
BlastMapping method. Then there is the issue of testing -- do you
have (or could you write) a test suite that would test this
thoroughly, that we could incorporate into our standard tests or
megatests (if the tests would take too long or require resources too
big to be incorporated in the standard release package)?

FYI, I have summarized the blast refactoring in the Pygr Dev
discussion group and also somewhat in the issue tracker.
Unfortunately, I haven't updated the regular documentation yet,
because Istvan changed our documentation format to Restructured Text,
and we (I) have to convert all the existing docs (50,000 words worth)
to REST before updating its content. The switch to REST seems like a
great idea, but it has slowed the updating of the docs to reflect all
the 0.8 changes.

-- Chris

Christopher Lee

unread,

Mar 18, 2009, 5:32:10 PM3/18/09

to pygr...@googlegroups.com

On Mar 18, 2009, at 1:06 PM, Istvan Albert wrote:

> There is an issue in the tracker for this:
>
> http://code.google.com/p/pygr/issues/detail?id=61

Great!

> I'd say that to get it incorporated the logger module now currently
> under tests/testlib/logger.py would need to be copied/moved to the
> main pygr package, then each sys.stderr message would need be replaced
> with a call to logger.debug(), info(), warn() or error().

I guess that would mean we'd need a test suite for logger's
functionality, since it's now being incorporated as part of Pygr?

>
>
> Probably it's best if only one module is migrated at once, to see how
> it works out in practice.
> The issue above also mentions the verbosity of these messages that
> would be best to be made shorter. It is fairly straightforward to do,
> all it needs some on-the-fly decision making on what severity level
> corresponds to each of the messages. (severity levels rise in the
> order listed above).

There aren't *too* many cases where we currently print to stdout /
stderr. Changing these to logger calls should not break algorithmic
correctness, so once logger itself is tested, I see no reason not to
make this change in one clean sweep.

Release Management Question: is this a functionality change (and
therefore delays the alpha release), or is this "debugging" that can
occur after the alpha?

> The way I typically do it is that I have debug messages be littered
> throughout the code but these are suppressed by default and get turned
> on only when testing with higher verbosity modes (1 or 2). Info and
> above are printed by default.

I agree, with only one caveat. Unlike a compiled language like C
where debug messages can even be placed deep in inner loops, because
compiling without debug flags takes them out during preprocessing, in
an interpreted language like Python there is a performance hit for
every debug statement (presumably an IF statement has to be executed
for each one). So this becomes unattractive in code where performance
is important. Do you know of any good way around this in Python?

-- Chris

C. Titus Brown

unread,

Mar 18, 2009, 6:00:48 PM3/18/09

to pygr...@googlegroups.com

On Wed, Mar 18, 2009 at 01:43:14PM -0700, Christopher Lee wrote:
->

-> On Mar 18, 2009, at 12:34 PM, C. Titus Brown wrote:
-> > I don't think

-> >
-> > class SomeClass(...):
-> > __invert__ = classutil.standard_invert
-> > ...
-> > _inverseClass = FooBar
-> >
-> > is an obvious construct, although I do understand why you defined
-> > things
-> > that way. How about,
-> >
-> > __invert__ = classutil.lazy_create(FooBar)
-> >
-> > instead, for example? Then FooBar would be obviously and directly
-> > connected to __invert__.
->
-> OK, I see how this is clearer in a single line than setting __invert__
-> and _inverseClass in two consecutive lines in my original code. Your
-> lazy_create() would return a callable function that behaves like my
-> existing standard_invert. Do you want to implement this, or should I?

I'd like to be jealous and keep my metaphysical "lock" on seqdb, if you
don't mind... I'll make the patch discrete & clean and send it on to you
for review when I do it.

So: I will.

Alexander Alekseyenko

unread,

Mar 18, 2009, 6:11:39 PM3/18/09

to pygr-dev

> how much faster is your library processor than iterating over blasting
> all the sequences in the simple-minded approach?

It all depends on the size of the blast database you are comparing you
library to. In my case, I had a longish (1.2kb) sequence that I needed
to compare lots of 454 reads to. For each read inside the blastall
itself the 1.2kb reference sequence needs to be read using fastacmd
which we know is slow, plus there's some initialization going on,which
also requires some time, multiply that by the number of reads (on the
order of 10^5) and you can see that there's a lot of unnecessary work
being performed.

> Then there is the issue of testing -- do you
> have (or could you write) a test suite that would test this
> thoroughly, that we could incorporate into our standard tests or
> megatests (if the tests would take too long or require resources too
> big to be incorporated in the standard release package)?

I have no experience writing test case, but I suppose I could give it
a go early next week. I'm thinking of following the usage example I
described above, i.e. one long sequence as a blast database, plus a
bunch of its sub-sequences with known much coordinates as a library.
This should catch all of the problems I can think of.

Cheers,
Alex

P.S.

> FYI, I have summarized the blast refactoring in the Pygr Dev
> discussion group and also somewhat in the issue tracker.

Sorry, I fell about a year behind on the new going-ons with pygr,
trying to catch up.

Istvan Albert

unread,

Mar 18, 2009, 9:06:10 PM3/18/09

to pygr-dev

On Mar 18, 5:32 pm, Christopher Lee <l...@chem.ucla.edu> wrote:

> I guess that would mean we'd need a test suite for logger's
> functionality, since it's now being incorporated as part of Pygr?

It has a doctest based testsuite that executes when you run the logger
module on its own.

This could be incorporated in the main test runner. On the other hand
all test modules heavily use the logger thus we are already testing
it. If the logger would break the testutil module breaks as well, so I
think this may not need another explicit test.

> Release Management Question: is this a functionality change (and
> therefore delays the alpha release), or is this "debugging" that can
> occur after the alpha?

I think the logger should be added into the pygr package as soon as
possible and new
code should make use of it. Switching all prints to logging calls is
sort of an 'scratch an itch' like task, one evening when the time is
right someone can just swap out the prints to logger calls in 10
minutes. The important thing is to have the logger module present so
that we can actually do it.

> an interpreted language like Python there is a performance hit for
> every debug statement

> Do you know of any good way around this in Python?

Probably there isn't any proper methodology for that.

Technically the assert statement gets removed when python is invoked
with the optimization flag, and one could probably make calls to
logging messages as part of the assertion (ugh), but that is misusing
asserts ... and impractical anyhow

best,

Istvan

C. Titus Brown

unread,

Mar 20, 2009, 9:40:57 PM3/20/09

to pygr...@googlegroups.com

On Wed, Mar 18, 2009 at 10:51:36AM -0700, Christopher Lee wrote:
-> > Specific issues
-> > ---------------
-> >
-> > I don't like some aspects of the 'ifile' handling in SequenceFileDB.
-> > In particular, I dislike the part where it re-opens the file to
-> > re-read it and then closes 'ifile'; this may not be expected behavior
-> > ('ifile' itself is only used to grab the filename!) and I don't like
-> > closing user-passed-in file handles. Plus, pulling the filename out
-> > of a file handle disallows StringIO and other "file-like" handles such
-> > as sockets... Is the ifile mechanism actually used anywhere?
->
-> This was used by some coordinator module functionality that is made
-> obsolete by the much more powerful capabilities of pygr.Data. We can
-> certainly get rid of this ifile mechanism.

Removed.

-> > The 'set_seqtype' function on SequenceDB depends on things being a
-> > FASTA file
-> > instead of just iterating. Why?? Can we remove that?

[ ... ]

-> Conclusion: we could again get rid of read_fasta_one_line().

Removed.

-> > Also, 'set_seqtype' returns the sequence type as well as setting it on
-> > the db object. This behavior isn't used anywhere in the code base.
-> > Can
-> > we trash it?
->
-> Sounds fine to me.

Removed.

-> > In fact, it seems like 'set_seqtype' is used only within SequenceDB
-> > and its children. Why not rename it to be private, i.e. prepend it
-> > with '_'?
->
-> Fine by me.

done.

-> > I don't know what idFilter does in _store_seqlen_dict, and it seems
-> > like it would be easier to just implement that in your own
-> > seqfmt-based reader, anyway. Can we trash it? It's not used anywhere
-> > in the code base.
->

-> [ ... ] We could get rid of idFilter in _store_seqlen_dict.

Done; I removed the 'ifile' mechanism from there, too, since it's no
longer needed.

-> > The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and
-> > unused in the rest of the code base. I'd appreciate some guidance on
-> > what they do, and perhaps some tests to get started. (@CTB issue)
->
-> I don't think anyone has used these classes in at least a couple of
-> years. We could delete them.

Removed.

All changes tested & pushed to my github pygr repo, branch seqdb_review.

C. Titus Brown

unread,

Mar 20, 2009, 10:02:54 PM3/20/09

to pygr...@googlegroups.com

On Wed, Mar 18, 2009 at 01:43:14PM -0700, Christopher Lee wrote:
->

-> On Mar 18, 2009, at 12:34 PM, C. Titus Brown wrote:
-> > I don't think

-> >
-> > class SomeClass(...):
-> > __invert__ = classutil.standard_invert
-> > ...
-> > _inverseClass = FooBar
-> >
-> > is an obvious construct, although I do understand why you defined
-> > things
-> > that way. How about,
-> >
-> > __invert__ = classutil.lazy_create(FooBar)
-> >
-> > instead, for example? Then FooBar would be obviously and directly
-> > connected to __invert__.
->

-> OK, I see how this is clearer in a single line than setting __invert__
-> and _inverseClass in two consecutive lines in my original code. Your
-> lazy_create() would return a callable function that behaves like my
-> existing standard_invert. Do you want to implement this, or should I?

done; now you can do

__invert__ = classutil.lazy_create_invert(_SequenceDBInverse)

in a class definition to get a one-time-only lazily created __invert__
object per parent class. This object is saved in self._inverseObj.

I could have made the function entirely generic, e.g.

__invert__ = classutil.lazy_create(_SequenceDBInverse)

and stored the created object in an isolated and unreachable scope, but
then the object would have been inaccessible for debugging purposes,
too. If it is (or ends up being) a frequent pattern we might want to do
this anyway, but I thought it was a poor tradeoff for now.

Note, I didn't remove standard_invert from throughout the codebase, only
from within seqdb.py. We can change it elsewhere as we do code reviews.

C. Titus Brown

unread,

Mar 20, 2009, 10:10:51 PM3/20/09

to pygr...@googlegroups.com

-> > The SQLSequence stuff at the bottom of seqdb.py could probably go in a
-> > different file, like sqlgraph.py. They don't depend on anything in
-> > seqdb.py.
->
-> That seems like a reasonable idea.

Moved. Works fine!

-> I'll answer the PrefixUnionDict questions in a separate email.

I didn't see anything on this -- no hurry, just wanted to make sure I
didn't miss something in the e-mail flurry ;)

C. Titus Brown

unread,

Mar 20, 2009, 10:34:12 PM3/20/09

to pygr...@googlegroups.com

On Wed, Mar 18, 2009 at 06:06:10PM -0700, Istvan Albert wrote:
-> On Mar 18, 5:32?pm, Christopher Lee <l...@chem.ucla.edu> wrote:
->
-> > I guess that would mean we'd need a test suite for logger's ?
-> > functionality, since it's now being incorporated as part of Pygr?
->
-> It has a doctest based testsuite that executes when you run the logger
-> module on its own.
->
-> This could be incorporated in the main test runner. On the other hand
-> all test modules heavily use the logger thus we are already testing
-> it. If the logger would break the testutil module breaks as well, so I
-> think this may not need another explicit test.

I agree, at least for the moment. The logger module is pretty trivial
and if we use it everywhere we'll quickly find out when it breaks ;)

-> > Release Management Question: is this a functionality change (and ?
-> > therefore delays the alpha release), or is this "debugging" that can ?
-> > occur after the alpha?
->
-> I think the logger should be added into the pygr package as soon as
-> possible and new
-> code should make use of it. Switching all prints to logging calls is
-> sort of an 'scratch an itch' like task, one evening when the time is
-> right someone can just swap out the prints to logger calls in 10
-> minutes. The important thing is to have the logger module present so
-> that we can actually do it.

So, I put this change into my seqdb_review branch and changed seqdb.py
to make use of the new pygr.logger module; you can see the resulting
code changes here,

http://github.com/ctb/pygr/commit/e96845367ecd7c404c26a623be5adb705b0b4ba9

The only problem I ran into is that I would like to turn off the logging
messages from the main code when running the test suite! Right now I've
done that by using 'logger.debug' in seqdb.py; to see how annoying it is
to do anything else, try changing them to 'logger.info' and running the
tests...

Istvan Albert

unread,

Mar 20, 2009, 10:57:16 PM3/20/09

to pygr-dev

On Mar 20, 10:34 pm, "C. Titus Brown" <c...@msu.edu> wrote:

> The only problem I ran into is that I would like to turn off the logging
> messages from the main code when running the test suite!

The way this currently works is that you will start seeing the debug
messages
if the verbosity level passed to runtest.py is 2

runtest.py -v 2

for other verbosity levels the debug level is turned off but the other
levels are always on.

Istvan

PS> So, I put this change into my seqdb_review

hey Titus, you have a seqdb_review and a seqdb-review branch up there,
something gut pushed onto a different branch

C. Titus Brown

unread,

Mar 20, 2009, 11:02:05 PM3/20/09

to pygr...@googlegroups.com

On Fri, Mar 20, 2009 at 07:57:16PM -0700, Istvan Albert wrote:
-> On Mar 20, 10:34?pm, "C. Titus Brown" <c...@msu.edu> wrote:
->
-> > The only problem I ran into is that I would like to turn off the logging
-> > messages from the main code when running the test suite!
->
-> The way this currently works is that you will start seeing the debug
-> messages
-> if the verbosity level passed to runtest.py is 2
->
-> runtest.py -v 2
->
-> for other verbosity levels the debug level is turned off but the other
-> levels are always on.

Yes, but I think the info/warn/error messages from the test code should
be printed under different circumstances than the messages from the pygr
code.

-> PS> So, I put this change into my seqdb_review
->
-> hey Titus, you have a seqdb_review and a seqdb-review branch up there,
-> something gut pushed onto a different branch

Yeah, I know. it's seqdb_review that you should look at; I'll delete
the other one.

--t

Istvan Albert

unread,

Mar 20, 2009, 11:11:14 PM3/20/09

to pygr-dev

On Mar 20, 11:02 pm, "C. Titus Brown" <c...@msu.edu> wrote:

> Yes, but I think the info/warn/error messages from the test code should
> be printed under different circumstances than the messages from the pygr
> code.

Not sure what you mean here. Do you mean creating a custom log level
like TEST in between DEBUG (10) and INFO (20) ?

C. Titus Brown

unread,

Mar 20, 2009, 11:16:05 PM3/20/09

to pygr...@googlegroups.com

On Fri, Mar 20, 2009 at 08:11:14PM -0700, Istvan Albert wrote:
-> On Mar 20, 11:02?pm, "C. Titus Brown" <c...@msu.edu> wrote:
->
-> > Yes, but I think the info/warn/error messages from the test code should
-> > be printed under different circumstances than the messages from the pygr
-> > code.
->
-> Not sure what you mean here. Do you mean creating a custom log level
-> like TEST in between DEBUG (10) and INFO (20) ?

No, I mean redirecting the pygr.logger stuff while the tests themselves
are running. This would probably mean reintroducing a separate test
logging module.

I'll play around with it next time it irritates me and see if I can come
up with a clean solution.

Istvan Albert

unread,

Mar 20, 2009, 11:28:38 PM3/20/09

to pygr-dev

On Mar 20, 11:16 pm, "C. Titus Brown" <c...@msu.edu> wrote:

> I'll play around with it next time it irritates me and see if I can come
> up with a clean solution.

I see, I think I know what you mean.

The text runner itself prints dots or other information, and that can
awkwardly intermingle with the output from the logger, ends up
unsightly. This is a pet peeve of mine too ( in case this is you are
referring to).

One solution that I briefly entertained is to rewrite (provide my own)
TextTestRunner class, but was never sufficiently mad to actually do
it.

Istvan

C. Titus Brown

unread,

Mar 20, 2009, 11:30:11 PM3/20/09

to pygr...@googlegroups.com

On Fri, Mar 20, 2009 at 08:28:38PM -0700, Istvan Albert wrote:
-> On Mar 20, 11:16?pm, "C. Titus Brown" <c...@msu.edu> wrote:
->
-> > I'll play around with it next time it irritates me and see if I can come
-> > up with a clean solution.
->
-> I see, I think I know what you mean.
->
-> The text runner itself prints dots or other information, and that can
-> awkwardly intermingle with the output from the logger, ends up
-> unsightly. This is a pet peeve of mine too ( in case this is you are
-> referring to).

Exactly.

-> One solution that I briefly entertained is to rewrite (provide my own)
-> TextTestRunner class, but was never sufficiently mad to actually do
-> it.

That way lies madness and/or nose.

;)

--titus

Alexander Alekseyenko

unread,

Mar 24, 2009, 1:44:27 PM3/24/09

to pygr-dev

Hi Chris,

The testing for blast_library can follow the test for a single
sequence blast in blast_test.py. Here's the necessary diff, assuming
that blast_library is added as a class member to BlastDB

87,88c87
< for id,s in sp.iteritems(): # FOR EVERY SEQUENCE IN
SWISSPROT
< sp.blast(s,msa,expmax=1e-10, verbose=False) #
GET STRONG HOMOLOGS, SAVE ALIGNMENT IN msa
---
> sp.blast_library(sp,msa,expmax=1e-10, verbose=False)
120,122c119,120
< for id,s in sp.iteritems(): # FOR EVERY SEQUENCE IN SWISSPROT
< sp.blast(s,msa,expmax=1e-10, verbose=False) # GET
STRONG HOMOLOGS, SAVE ALIGNMENT IN msa
< msa.build(saveSeqDict=True) # DONE CONSTRUCTING THE ALIGNMENT,
SO BUILD THE ALIGNMENT DB INDEXES
---
> sp.blast_library(sp,msa,expmax=1e-10, verbose=False) # GET STRONG HOMOLOGS, SAVE ALIGNMENT IN msa
> msa.build(saveSeqDict=True) # DONE CONSTRUCTING THE ALIGNMENT, SO BUILD THE ALIGNMENT DB INDEXES
138c136
<
---
>

Christopher Lee

unread,

Mar 25, 2009, 4:43:07 PM3/25/09

to pygr...@googlegroups.com

Hi Titus,
one question on the usage of the variable name 'id': since 'id' is
also the name of a Python built-in function, for the last year or so I
have been trying to eliminate usage of 'id' as a variable name in
Pygr, since that usage would block access to the built-in function of
the same name. (By contrast, using 'id' as an attribute name seems
fine, because that doesn't interfere with access to the built-in
function).

I noticed that in seqdb-review you reverted some of my changes, e.g.
you changed 'seqID' back to 'id'. I prefer 'seqID', both because it
states specifically what the variable is, and because of the concern
stated above. I propose that we be consistent throughout Pygr in
naming such variables 'someID', i.e. with a meaningful prefix that
tells you what kind of identifier it is, and with ID in upper-case.
That's the pattern I am now following generally.

What do you think?

-- Chris

C. Titus Brown

unread,

Mar 25, 2009, 4:44:39 PM3/25/09

to pygr...@googlegroups.com

On Wed, Mar 25, 2009 at 01:43:07PM -0700, Christopher Lee wrote:
-> one question on the usage of the variable name 'id': since 'id' is
-> also the name of a Python built-in function, for the last year or so I
-> have been trying to eliminate usage of 'id' as a variable name in
-> Pygr, since that usage would block access to the built-in function of
-> the same name. (By contrast, using 'id' as an attribute name seems
-> fine, because that doesn't interfere with access to the built-in
-> function).
->
-> I noticed that in seqdb-review you reverted some of my changes, e.g.
-> you changed 'seqID' back to 'id'. I prefer 'seqID', both because it
-> states specifically what the variable is, and because of the concern
-> stated above. I propose that we be consistent throughout Pygr in
-> naming such variables 'someID', i.e. with a meaningful prefix that
-> tells you what kind of identifier it is, and with ID in upper-case.
-> That's the pattern I am now following generally.

sure -- I thought you'd standardized on 'id' and was moving the other
direction ;)

Christopher Lee

unread,

Mar 25, 2009, 5:46:23 PM3/25/09

to pygr...@googlegroups.com

Hi Titus,
regarding PrefixUnionDict.__contains__, I guess we should get rid of
the behavior of returning True for sequence keys that are from the
member dictionaries. That behavior is now supplied by the
_PrefixUnionDictInverse.

What do you think? The challenge will be to identify what if any code
relies on the old behavior... grrr. It's quite possible that no
existing code uses that behavior, but checking that will take some work.

-- Chris

Christopher Lee

unread,

Mar 25, 2009, 6:00:32 PM3/25/09

to pygr...@googlegroups.com

Hi Titus,
another policy question for you: should inverse mapping objects like
_SequenceDBInverse provide a full dictionary interface, e.g. using
DictMixin as we have done elsewhere?

-- Chris

C. Titus Brown

unread,

Mar 25, 2009, 9:32:06 PM3/25/09

to pygr...@googlegroups.com

On Wed, Mar 25, 2009 at 03:00:32PM -0700, Christopher Lee wrote:
-> another policy question for you: should inverse mapping objects like
-> _SequenceDBInverse provide a full dictionary interface, e.g. using
-> DictMixin as we have done elsewhere?

Hmm, probably. It's pretty easy to implement... sorry I missed that,
I'm not very conversant with __invert__ behavior ;)

C. Titus Brown

unread,

Mar 25, 2009, 9:33:05 PM3/25/09

to pygr...@googlegroups.com

On Wed, Mar 25, 2009 at 02:46:23PM -0700, Christopher Lee wrote:
-> regarding PrefixUnionDict.__contains__, I guess we should get rid of
-> the behavior of returning True for sequence keys that are from the
-> member dictionaries. That behavior is now supplied by the
-> _PrefixUnionDictInverse.
->
-> What do you think? The challenge will be to identify what if any code
-> relies on the old behavior... grrr. It's quite possible that no
-> existing code uses that behavior, but checking that will take some work.

How about deprecating it -- raising a warning in 0.8, perhaps? -- and
then removing it in 0.9?

Or, we could remove it in 0.8 after we get the megatests working....

Christopher Lee

unread,

Mar 25, 2009, 11:57:23 PM3/25/09

to pygr...@googlegroups.com

Hi,
I've finished merging the seqdb_review branch to master and pushed it
to the public git repository. It was quite a bit of work and taught
us some lessons about how to do code review, which I will try to
summarize on a wiki page.

Many thanks to Titus, who has done an incredible amount of work to
clean up and improve the seqdb module!

We have now completed a series of major refactor / reintegrate cycles
for 0.8 using our various git branches:
- Istvan's reworking of the test framework
- Istvan's replacement of the documentation framework
- Titus code review and cleanup of seqdb
- my pygr.Data / metabase refactoring
- Marek's & my updating of the megatests

Taken together, these position us very well for the upcoming 0.8
release. We can release the 0.8 alpha (function freeze) within a week
I think. Congratulations and thanks to everyone who has helped to
make all this happen!

Next we need to update the docs to reflect everything new / changed in
0.8!

Yours,

Chris

C. Titus Brown

unread,

Mar 26, 2009, 7:25:33 AM3/26/09

to pygr...@googlegroups.com

On Wed, Mar 25, 2009 at 08:57:23PM -0700, Christopher Lee wrote:
-> I've finished merging the seqdb_review branch to master and pushed it
-> to the public git repository.