Conclusions from SciPy08 Sprint

Christopher Lee

unread,

Aug 21, 2008, 1:25:39 PM8/21/08

to Pygr Development Group

Hi,
I thought I'd try to summarize the conclusions from yesterday's
discussion.

- ask Python about the solution for shelve problems. What is the
official "long-term solution"?

- test sqlgraph with sqlite3 to ensure we support that platform

Improved sequence database
- we need to finalize the proposed SequenceDB API (see the wiki page),
then merge the best ideas from the seqdb2 prototype, the
FileDBSequence implementation, and implement them under the new API

- we need to propose an API for quantitative data bound to sequence,
either as another kind of sequence, or an annotation. We should start
from the core operations that one needs for using these kind of data

- Russell Meches proposed using netcdf for storing quantitative data,
and will do some performance tests for working with variable length
records

- SQLTable, SQLGraph should provide a standard way to control
ordering. This seems like a general recommendation; maybe all
container and mapping classes should have a consistent way to control
the order of iteration.

- we need to fill in some holes in sqlgraph to ensure that all the
kinds of containers and mappings Jenny needs for Ensembl can be
provided by standard components. Only cases that require a join
across multiple tables should require writing a custom class;
everything else should be handled by a standard component.

- we need a final resolution to the "Ensembl Assembly Version"
question. I would prefer to push hard for a definitive statement from
Ensembl about what genome standard file they are using in each case.
They *must* have that information internally, and it is simply
unacceptable that this information is not available programmatically.
Otherwise we're forced to do a full mapping of each Ensembl genome (to
the extent we can reconstruct it from whatever sources Ensembl
provides programmatically) onto the standard genome that everyone else
uses. This is feasible but really suboptimal -- it tends to impose a
heavy burden on us, and on users, who would have to use an extra
mapping layer to connect any results from Ensembl to any analysis done
with standard genomes (such as UCSC alignments).

- we need to write "Developer Guidelines" to mandate how developers
should get code and make their own code accessible, install and build,
file bug reports and track issues. The developer group discussion
list is great, but doesn't solve all problems. We need to establish
some consistency, or we'll waste time trying to figure out why
different developers run into problems that others can't reproduce.

- windows testing suggests that the problems are more in the test
setup itself rather than Pygr bugs, so we should try to solve these
and get the test suite running fully on Windows.

I'm going to send this right now, though there may be issues I
missed. Please add your thoughts.

-- Chris

C. Titus Brown

unread,

Aug 21, 2008, 2:01:21 PM8/21/08

to pygr...@googlegroups.com

On Thu, Aug 21, 2008 at 10:25:39AM -0700, Christopher Lee wrote:
-> I thought I'd try to summarize the conclusions from yesterday's
-> discussion.

Here are my notes.

--t

pygr-sprint-aug-2008.txt

Jenny Qing Qian

unread,

Aug 21, 2008, 2:59:18 PM8/21/08

to pygr...@googlegroups.com

Unpickling saved resources isn't working too well.

Unpickling of the ensembl annotation database objects is working now. Thanks, Chris!

Whoops, pyrex stuff isn't installed on Jenny's computer :)

In the first iteration of prototyping a Python ensembl API (Before the midterm review of this GSoC project), I tried in a conventional (Object-Relational Mapping) way, it is messy and tedious, not fun...
In the second iteration, I learned to use standard pygr components (sqlgraph.SQLTable, sqlgraph.TupleO, seqdb.AnnotationDB, sqlgraph.SQLGraph), it's getting interesting...
In the third iteration, I will try to make basic ensembl business objects subclasses of the AnnotationDB class, save ensembl mapper and graph objects to pygr.Data, and possible do something that involves pyrex since it is installed now :)

jenny

Christopher Lee

unread,

Aug 21, 2008, 5:50:40 PM8/21/08

to pygr...@googlegroups.com

On Aug 21, 2008, at 11:01 AM, C. Titus Brown wrote:

> Unpickling saved resources isn't working too well.

The problem had nothing to do with Pygr. But it was interesting, and
might be considered a failure of the Python pickle module to raise an
appropriate warning message. Jenny was trying to test her code using
doctests that are inserted directly in the module file (adaptor.py)
whose classes she was trying to test. She ran the tests by

python adaptor.py

But in this case, note that the module is never *imported*, and Python
assigns each class a __module__ attribute of '__main__' instead of the
actual module name 'adaptor'. Python pickling depends on the module
name for automatically re-importing the module that contains the
necessary class(es) during unpickling. Normally, the pickle module
performs a check that the class is actually found in the specified
module namespace, and raises an exception if not. However, in this
case it raised no error or warning message at all. And of course,
when you try to unpickle the object, it fails with a cryptic error
message, decipherable only by someone immersed in pickling methods.

I guess this is another example of a "bug" that's actually in the
testing setup, rather than in the code to be tested. It's a good
thing I didn't spend a bunch of time reading all her code over the web
to debug why her classes couldn't be pickled -- there never was
anything wrong with them! The only way to debug the problem was to
see exactly HOW she ran the test... which is different from how I
normally run tests, and thus never would have occurred to me.

Workarounds:
- to avoid this problem, write a separate script that imports the
module to be tested, and invokes the doctests on that module
- I added a check in pygr.Data's pickler subclass to catch this
situation and print an error message explaining what the user must do
to fix the problem. This addresses the fact that Python pickle fails
to give any kind of warning about this case. I also added a test to
the test suite to verify that our check detects this problem.

Further comments on whether the Python pickler should trap this as an
error, or at least provide a warning:

Strictly speaking, it's not *always* an error to pickle a class whose
__module__ is '__main__'. It's conceivable that the user will
guarantee that the class will already be loaded in __main__ on the
receiving side, in which case unpickling will succeed (you could argue
that it doesn't actually unpickle the class in this case; it just
finds it already present in memory). But note that this wierd usage
short-circuits the key feature of unpickling, i.e. that the unpickler
automatically finds and imports the right classes for you.

I'd guess that in 99% of real usage, this condition is simply an error
and will cause unpickling to fail, baffling the user. In the context
of pygr.Data (which is supposed to retrieve your data for you, without
you having to do anything else), this condition is *always* an error.

I think the pickle module should at least output a warning message
explaining the problem, which you could suppress by passing a
verbose=False argument. Or perhaps, it should (by default) raise an
exception, unless you explicitly set an option to permit this unusual
pickling scenario (e.g. allow__main__=True).

Titus, do you think it would make sense to pass this question on to
the Python folks? I haven't found discussion among the Python dev
people about this, although I see other people on the web running into
the same problem...

-- Chris

Jenny Qing Qian

unread,

Aug 25, 2008, 5:36:44 AM8/25/08

to pygr...@googlegroups.com

I tried to save and retrieve ensembl mappers and graphs to pygr.Data and it works! This is sooo cool because now users don't have to know the mappers and graphs and they don't even need to know pygr.Data!!!

qing@1[ensembl]$ python
Python 2.5.2 (r252:60911, Aug 8 2008, 09:22:44)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from ensembl.adaptor import *
>>> serverRegistry = get_registry(host='ensembldb.ensembl.org', user='anonymous')
>>> coreDBAdaptor = serverRegistry.get_DBAdaptor('homo_sapiens', 'core', '47_36i')
>>> transcriptAdaptor = coreDBAdaptor.get_adaptor('transcript')
>>> exonAdaptor = coreDBAdaptor.get_adaptor('exon')

>>> exonAnnoDB = coreDBAdaptor._get_annotationDB('exon', exonAdaptor)

>>> exon= exonAnnoDB[73777] # get an exon annotation
>>> transcripts = exon.ensemblTranscripts # get all the transcripts to which the given exon belongs
>>> len(transcripts)
1
>>> for t in transcripts:
... print t.id, repr(t.sequence), len(t.sequence)
...
12511 -chr10[311431:725518] 414087

>>> transcriptAnnoDB = coreDBAdaptor._get_annotationDB('transcript', transcriptAdaptor)

>>> transcript = transcriptAnnoDB[12511] # get a transcript annotation
>>> exons = transcript.ensemblExons # get all the exons of the given transcript
>>> len(exons)
37
>>> for e in exons:
... print e.id, len(e.sequence)
...
73665 85
73667 72
73682 111
73699 126
73711 210
73722 135
73734 120
73758 198
73777 92
73791 111
73805 124
73821 110
73839 103
73854 65
73869 94
73886 120
73909 115
73926 140
73945 137
73961 209
73977 115
73992 202
74007 110
74021 81
74032 124
74050 122
74065 112
74080 110
74095 131
74118 169
74135 171
74154 62
74171 58
74190 75
74206 175
74223 124
74236 2086

--jenny

Reply all

Reply to author

Forward