Stealing, err, "borrowing" from sqlalchemy

1 view
Skip to first unread message

C. Titus Brown

unread,
Jan 4, 2009, 6:11:57 PM1/4/09
to pygr...@googlegroups.com
Hey all,

in order to support more databases than MySQL, e.g. sqlite and postgres,
we need to add in code for slurping in database schemas from those
databases. This is one of the cases where there is no SQL standard
and everyone does it completely differently -- sqlite has .schema, for
example.

I would like to propose incorporating code from SQLAlchemy (henceforth
'sa'),

http://www.sqlalchemy.org/

for this purpose. SQLAlchemy already supports a wide range of
databases, and it's under the MIT License so we can borrow without
restraint.

Specifically, we might want to borrow --

- the sa URL opening code
- the sa schema reading code

sa is a very nice, general object-relational mapper, so it might be
worth looking into it as a general solution for O/R mapping in pygr, too
-- why maintain our own code base? Performance would be the main
potential blocker; I don't know how fast sa is.

Comments or thoughts?

cheers,
--titus
--
C. Titus Brown, c...@msu.edu

Istvan Albert

unread,
Jan 6, 2009, 12:58:54 PM1/6/09
to pygr-dev


On Jan 4, 6:11 pm, "C. Titus Brown" <c...@msu.edu> wrote:

> Specifically, we might want to borrow --
>
>   - the sa URL opening code
>   - the sa schema reading code

> Comments or thoughts?

Instead of borrowing the code and subsequently maintaining it, it
might be easier to just redistribute sa with pygr. For simplicity and
to cut down on directory structure, packing it into into a zip file
and add it to the import path.

Istvan

Christopher Lee

unread,
Jan 7, 2009, 2:44:00 PM1/7/09
to pygr...@googlegroups.com
Yesterday Jenny and I discussed a variety of issues about
pygr.sqlgraph and sqlalchemy, which seem relevant to this thread...

-- Chris

Jenny Qian
1/6/09 12:28 PM
Hi Chris, I've been studying the sqlgraph module ...
Christopher Lee
1/6/09 12:28 PM
hi
Jenny Qian
1/6/09 12:29 PM
How do you like Istvan's proposal of making sa part of pygr's
dependency?
Christopher Lee
1/6/09 12:30 PM
I'd have to analyze sa's capabilities very carefully. I suspect it
lacks a lot of what we need, but it would be great if it supplied
everything we needed...
1/6/09 12:30 PM

performance is a major issue
Jenny Qian
1/6/09 12:32 PM
SQLALchemy seems to have more tedious interface than SQLTable.
Christopher Lee
1/6/09 12:34 PM
if you've already studied sa a bit, maybe we should together analyze
whether it will fulfill pygr's needs.
Jenny Qian
1/6/09 12:35 PM
One thing about sqlgraph, is it true is can only access database info,
but not update database tables?
Christopher Lee
1/6/09 12:37 PM
sqlgraph has been used almost exclusively in read-only mode, but I
have a branch where I added INSERT and UPDATE support. We could merge
that to the master branch.
Jenny Qian
1/6/09 12:39 PM
sqlgraph only supports mySQL?
Christopher Lee
1/6/09 12:43 PM
it uses generic SQL, so it should work with any database. It's only
been *tested* with mysql, so there will probably turn out to be a few
places where we need to adjust the SQL to ensure compatibility. The
one hole in the python DB API 2.0 is that it provides no specification
for retrieving schema information. That means there just *isn't* any
generic way to request schema information from a database. This is
very annoying. We'd have to code different methods for getting schema
information from different database platforms such as mysql, postgres,
sqlite etc. But thankfully this is a relatively small and isolated
function, so that's probably not too hard.
Jenny Qian
1/6/09 12:44 PM
sorry, schema means the inter-relations between tables?
Christopher Lee
1/6/09 12:51 PM
For sqlgraph, at a minimum it needs to read all the column names for a
table and automatically determine which is the primary key. But
looking forward to the future, yes, I'd like it to automatically infer
the ER diagram (foreign key relations between tables). Something like
that could greatly facilitate connecting to a new database like
Ensembl! Pygr has some fairly simple code to do this, which I've used
with my lab's databases, but I don't think anyone else has used that
code...
Jenny Qian
1/6/09 12:53 PM
Can you show me the code?
1/6/09 12:56 PM

What I really liked about sqlgraph is the use of python's weakref
module. My understanding is that the combination of python's
weakref.WeakValueDictionary and the SQL select based on the dbID could
be powerful in generally improving the performance. Is that right?
Christopher Lee
1/6/09 12:59 PM
Yes. We've had to address many performance / memory usage issues
because of the very large datasets we work with. Namshin Kim kept
running into problems where the performance was inadequate either due
to accessing data row by row or using up huge amounts of memory
resulting in virtual memory swapping. We solved these problems in
several ways:
1/6/09 1:04 PM

- implicit control of how data is loaded: ordinarily, you don't want
to load an entire table into local memory. So standard iteration
(i.e. __iter__) only gets a list of the primary keys; the actual data
will only be loaded if the user requests the object corresponding to a
given key. And as you pointed out, as soon as a user drops a
reference to an object, the WeakValueDictionary will automatically
flush it from memory, keeping memory usage low. By contrast, if the
user actually wants all the data loaded in local memory, they can call
the items() method, which will load the whole table with a single
SELECT call, for maximum efficiency. Thus, the user has a fairly
intuitive way to choose row-by-row loading (the default) vs. efficient
whole-table loading (using items()).
Jenny Qian
1/6/09 1:07 PM
I found that out by studying the sqlgraph as well. This is very cool.
Christopher Lee
1/6/09 1:09 PM
- clusterKey: many kinds of data have some kind of grouping structure,
for example, in a table storing exon information, we almost never need
to look simultaneously at two exons from different genes. I.e. exons
form a gene structure, and we typically just want all the exons from
one gene at a time. So we can greatly speed up data retrieval by
loading simultaneously all rows that share the same clusterKey value.
If there are an average of 20 exons per gene, that would speed up
loading of data from the server by a factor of 20...
Jenny Qian
1/6/09 1:11 PM
I also read a bit about the sa's way to handle eager/lazy loading, it
is more demanding on user's knowledge/awareness about *performance*.
1/6/09 1:12 PM

and it is less specific to the nature of biological data.
1/6/09 1:23 PM

gsc is trying to lay a sqlalchemy ORM between the solexa database and
the zope website...
1/6/09 1:25 PM

I will read more about the eager/lazy loading of SA. Just to see what
they can offer...
1/6/09 1:26 PM

Currently, we don't have integration between SA and pygr.Data yet,
right?
Christopher Lee
1/6/09 1:28 PM
no, no such integration
Jenny Qian
1/6/09 1:29 PM
Can sqlgraph objects be easily submitted to zope?
1/6/09 1:30 PM

How difficult is it to provide connectives between pygr's ORM and Zope?
Christopher Lee
1/6/09 1:34 PM

zope is python. So presumably zope can be made to work with anything
that has a python interface such as sqlalchemy or pygr...
Jenny Qian
1/6/09 1:50 PM

currently, the integration between sa and zope is already available...
Christopher Lee
1/6/09 1:54 PM
it would be useful to make a table comparing sqlalchemy and sqlgraph
side by side...
Jenny Qian
1/6/09 1:56 PM
you are right. I may try this
1/6/09 2:12 PM

Hi Chris, can you be more specific about what you need, so when I am
studying SA and drawing the comparison tables, I could be more
specific as well?
Christopher Lee
1/6/09 2:13 PM
performance features are an important area; all the performance issues
that we discussed above.
Jenny Qian
1/6/09 2:14 PM
ic
Christopher Lee
1/6/09 2:14 PM
interface issues: sqlgraph follows a strict dict-like interface, with
the primary key as dictionary key. What does sqlalchemy do?
Jenny Qian
1/6/09 2:15 PM
ok
Christopher Lee
1/6/09 2:19 PM
class issues: sqlgraph uses itemClass and itemSliceClass parameters to
let the user control what class will be used to construct row objects,
or slices of row objects (a concept which probably doesn't exist in
sqlalchemy). sqlgraph uses "subclass binding" to customize a generic
row class to efficiently access object attributes. What does
sqlalchemy do?
1/6/09 2:20 PM

pickling issues: can sqlalchemy table or row objects be pickled and
unpickled? Will they work in pygr.Data? (i.e. automatically reconnect
to the correct server, as sqlgraph does?)
Jenny Qian
1/6/09 2:22 PM
*pickling* is for sure an advantage of sqlgraph objects.
1/6/09 2:25 PM

In other words, they don't provide a clean interface as pygr.Data
does. User has to re-run the script to require a database connection
explicitly.
Christopher Lee
1/6/09 2:26 PM
memory usage issues: sqlgraph uses a variety of tricks to minimize
memory usage. For example, the default row object stores each row as
a Python tuple, eliminating the need to store a __dict__ for each row
object. It also provides another mechanism (SQLRow) that only loads
attribute data when users ask for it; this is used for SQLSequence,
which provides virtual sequence access (automatic slicing) without
having to load a whole sequence string into memory. That is crucial
for applications that store an entire genome, chromosome or other
large sequence as a database row...
Jenny Qian
1/6/09 2:27 PM
wow.
Christopher Lee
1/6/09 2:28 PM
we can work together on this comparison, once you put together an
initial list of points...
Jenny Qian
1/6/09 2:28 PM
But sa does seem to have a few advantages as well like simplifying
sophisticated joins and subqueries.
Christopher Lee
1/6/09 2:28 PM
that would be great to learn about
Jenny Qian
1/6/09 2:28 PM
sounds good.
1/6/09 2:30 PM

Is strict dict-like interface really an advantage?
Christopher Lee
1/6/09 2:32 PM
i suspect that pygr.Data compatibility is the most important question
we'll have to look at. An object-relational interface aims to provide
an object interface to persistent data. The purpose of pygr.Data is
to make database persistence totally transparent, at both the schema
and code level. A traditional ORM like sqlalchemy only goes half way
towards this goal of transparent persistence; you can't just say "I
want the data named Bio.Genome.Foo" and start working...
Jenny Qian
1/6/09 2:33 PM
exactly, I totally agree.
Christopher Lee
1/6/09 2:34 PM
Yes, I think a dict-like interface is a natural Python model for any
dataset with a primary key. Python programmers know how to use dict
interfaces; why should they have to learn something more complicated
to do the same thing?
Jenny Qian
1/6/09 2:35 PM
but then you don't always know the dbID when you want to retrieve a
row... Often you don't.
1/6/09 2:36 PM

instead you know the common name for a gene for instance.
Christopher Lee
1/6/09 2:36 PM
In that case SELECT seems like the right interface; again, it's
something everybody knows...
Jenny Qian
1/6/09 2:36 PM
so the where clause will be more useful in practice.
1/6/09 2:36 PM

ic
Christopher Lee
1/6/09 2:37 PM
sqltable has a select method...
Jenny Qian
1/6/09 2:37 PM
still pygr is not supporting 'order by'
1/6/09 2:37 PM

I know.
1/6/09 2:37 PM

the select method in the SQLTable
1/6/09 2:38 PM

We use 'Order by' so often...
Christopher Lee
1/6/09 2:39 PM
actually orderBy is a supported parameter for SQLTable and its
subclasses...
Jenny Qian
1/6/09 2:39 PM
hmm...
Christopher Lee
1/6/09 2:39 PM
added recently, don't remember when
Jenny Qian
1/6/09 2:40 PM
hehe
Christopher Lee
1/6/09 2:40 PM
needs to be tested systematically
Jenny Qian
1/6/09 2:41 PM
why do we need to separate SQLTable and SQLTableBase?
1/6/09 2:42 PM

Are all the tests related to sqlgraph.py in the sqltable_test.py and
graph_test.py?
Christopher Lee
1/6/09 2:45 PM
SQLTableBase is the base class for many flavors of access: SQLTable
(data get loaded as row objects); SQLTableClustered (uses clusterKey
as I described above); SQLTableNoCache (row objects do not load the
data locally, but access it from the database on the fly),
SQLTableMultiNoCache (provides a one-to-many interface suitable for a
foreign key), and a number of others.
Jenny Qian
1/6/09 2:46 PM
right, now I remember again ^.^
Christopher Lee
1/6/09 2:46 PM
sqlsequence_test.py as well
Jenny Qian
1/6/09 2:46 PM
ok

Reply all
Reply to author
Forward
0 new messages